Amendments to the Drawings; 

Attached hereto are sixty five (65) replacement sheets of drawings, which 
include Figs. 1 to 69. The attached replacement sheets replace all original drawing sheets. 
No new matter has been added. 

Attachment: sixty five (65) Replacement Sheets 
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REMARKS 

Claims 1 to 7 of the above identified parent case have been canceled, without 
prejudice. New claims 8 and 15 have been added and are pending herewith for examination 
in the above referenced continuation application. The amendments made do not add new 

matter to the application. 

In accordance with 37 C.F.R. § 1.121(b)(3), the Substitute Specification 
(including the Abstract, but without the claims) contains no new matter. The amendments 
reflected in the Substitute Specification (including Abstract) are to conform the Specification 
and Abstract to U.S. Patent and Trademark Office rules or to correct informalities. As 
required by 37 C.F.R. §§ 1.121(b)(3)(iii) and 1.125(b)(2), a Marked Up Version of the 
Substitute Specification comparing the Specification of record and the Substittite 
Specification also accompanies this Preliminary Amendment. Approval and entry of the 
Substitute Specification (including Abstract) is respectfiilly requested. 

The amendments reflected in the Drawings are to correct informalities. 
Approval and entry of the amendments to the Drawings are respectfully requested. 

It is respectfully submitted that the present invention is new, non-obvious, and 
useful. Prompt consideration and allowance of these claims is therefore respectfully 



requested. 



Respectfiilly Submitted, 



KENYON & KENYON 





Reg. No. 36,098 



One Broadway 



New York, New York 10004 
Telephone: (212) 425-7200 
Facsimile: (212)425-5288 
CUSTOMER NO. 2664 
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DATA PROCESSING METHOD AND DEVICE 



5 FTEI.D OF THE INVENTION 

The present invention relates to methods of operating and optimum use of reconfigurable 
arrays of data processing elements. 

BACKGROUND INFORMATION 

10 The limitations of conventional processors are beco ming more and more evident. The 
growing importance of stream-based applicat ions makes coarse-grain dynamically 
rpronfi purable architectures an attractive alte rn ative. See, e.g.. R. Hartenstein, R. Kress, & 
H Reinig. "A new FPGA architecmre for word-oriented dat apaths." Proc. FPL'94, Springer 
T,NCS. September 1994. at 849: H. Waingold et al.. "Ba ring it all to so ftware: Raw 

15 machines." lEEK Computer. September 1997. a t 86-93: PACT Corporation, "The XPP 
Communication Svstem." Technical Report 15 (2000); see gen erally 
hft p-//www.nactcorp.com. Thev combine the perform ance of ASICs, which are very risky 
«nH ex pensive (development and mask costs), with th e flexibilitv of traditional processors. 
Spp, for example. J. Becker. "Configura b le Svstems-on-Chip (CSoC)," (Invited Tutorial), 

20 Proc. of 9''' Proc. of XV Brazilian Symposium on I n tegrated Circuit Design (SBCCI 2002), 
(Se ptember 2002). 

The data paths of modem microprocessors reac h their limits by using static instruction sets. 
Tn <; pite of the possibilities that exist today in VLSI development, th e basic concepts of 

25 microprocessor architectures are the same a s 20 vears ago. The main processing unit of 

modem conventional mi r-ro processors. the datapath, in its actual structure follows the same 
«tyle puidelines as its predecessors. Although the development o f pipelined architectures or 
su perscalar concents in combination with data a nd instruction caches increases the 
performance of a modem microprocessor and all ows higher frequency rates, the main 

30 concept of a static datapath remains. Ther e fore, each operation is a composition of basic 

instmctions that the used processor o w ns. The benefit of the processor concept lies in the 

ability of executing strong control dominant applic a tion. Data or stream oriented applications 

are not well suited for this environment. The se q uential instruction execution isn't the right 

tar pet for that kind of application and needs h i ph bandwidth because of permanent 
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r^tr^n^mittrng of instr uction/data from and to mem o ry. This handicap is often eased by use 
r^fr^rhp<i in various stages. A sequential interconnection of filters, which p erform data 
mani pnlation without writing back the intermedia t e results would get the right optimisation 
cr.A r^H..r.tinn of bandw idth. Practically, this kind of chain of filters shou ld be constructed in 
5 a logical way and configured during runtime. Existin g approaches to extend instruction sets 
iisp. static modules- not modifi able during runtime. 

nnstomized microprocessors or ASICs are opt imized for one special application 
pnvirnnment. It is nearly impossible to use the s ame microprocessor core for another 
10 ap plication without loosing the perfor mance gain of this architecture. 

A new Rp proach of a flexible and high performance datapath c oncept i s needed, which allows 
for reconfiguring the functionality and for making this core mainly application independent 
without losing the performance needed for stream-based applications. 

15 

When using such arrayo, it is dcoirod to optimioo the waya reconfigurable array, it is desirable 
to optimize tiie way in which tiie array is coupled to other units, e.-gr-teg^^oa processor if 
the array is used as a corrnrr-n r n nrli ^rr t" ^pti^^i'-^ Tt also desirable to optimize the way 
in which the array is configured. 

Tlio prooont in\^ontion aimo at providing improvomont i i ovor tiio prior art. 
It iii to bo noted that tiio diooloouro of tho prooont invention dooo comprioo oovoral major parts 
i n its description that all rofor to waya of allowing for an optimum uoo of tiio array and honce 
arc cloooly related to each oth e r. 

It is abo to be noted that tho parts do comprioc a pluralit: , ^ of figures tiiat tiio text relates to 
h o vvovor without always giving an oxact, precise and correct roforoncc. Yet any deviations 
from correct reforonoing will bo obvious to the average skilled person. 

30 Further. WO 00/49496 discusses a method for execution of a computer program usmg a 
processor that includes a configural functional unit capable of exec uting reconfigurable 
instructions, which can he redefined a t runtime. A problem witii conventionable processor 
architectures exists if a coupling of for example, seouentional processors is needed and/or 
technologies such a.s a data-streaming, hvper- t hreading. multi-threading, multi-tasking, 

NY01 1113556V1 2 MARKED-UP VERSION OF THE 

NY01 1113556V1 SUBSTITUTE SPECIFICATION 



20 



25 



execution of parts of configurations, etc., are to be a use fiil wav for enhancing performance. 
Techniques discussed in prior art, such as W O 02/50665 AL do not allow for a sufficiently 
efficient wav of providing for a data exchange betw een the ALU of a CPU and the 
configurable d ata processing logic cell field, such as an FPGA. DSP, or other such 
arrangement. In the prior art, the the data exchange is effected via registers. In other words, 
it is necessary to first write data into a regi s ter seauentiallv, then retrieve them sequentially, 
and restore them sequentially as well. 

SUBSTITUTE SHEET (RULE 26) 

Table of Contents 

First major part 

4 

4 Executive Summar>^ ^ 



6 

Hardware ~ 

Code Optimizations ^ 

3-4^4 — Pipelining / Concurrency / Synchronicity 6 

2_i^3 — Core firoquency / Momor>^ Hierarchy 6 

— Softwmo / Multitasking Operating Syatomo 7 

QJ2 Communication Bctwoon the RISC Core and the XPP Coro 8 

3,2t4 — Streaming ^ 

2^ — Shared Memory (Main Momor)0 ^ 

— Shared Momor)^ (IRa\M) ^ 

State of the XPP Coro ^ 

— Limiting Mcmor>^ Traffic ^0 

2A Context Switchoo 

2AA — SMT Virtual Processor Switch ^ ^ 

— Interrupt Sor\dco Peoutinc ^ ^ 

— Task Switch 

2S A Load Store i\rchitocturo 

2^ — A Basic Implementation ^ ^ 

2^72 — Implementation Improvements ^7 

2,5 .3 Support for Data Distribution and Data PLCorganization 18 

Software 1 Hardw^ar e Interface 



2^ 
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2^ — Explicit Cache 

SvT i\nothGr Implementation — 

g^T^ — Programming Mod e l Changes 

Qr:=f^ — Hiding Implementation Detailo 



Program Optimizations * 



^^4 Code Analysis 

^AA — Data Flow i\naIysio 

— Data D e pendonco Analysis 

10 3t4^3 — Intorproc e duml Alias Analysis 

:^rirA — Intorprocodural Value P^ango i\nalysis 29 

^,4t5 — Alignm e nt Analysis 

^ Cod e Optimizations ^ ^ 

— General Transformations 31 

15 ^v2t3 — Loop Transformations 32 

^JSrr^ — Data Layout Optimizations ^0 

^J3^ — Example of application of the optimizations ^ 1 

4 Compiler Specification for the PACT XPP 

20 4A Introduction 

Compiler Structure 

4:34 — Code Pr e paration 

4t3t3 — Partitioning 

1.2.3 — RISC Code Generation and Scheduling ^7 

25 4t3 XPP Compiler for Loops ^7 

4t^t4 — Temporal Partitioning 

4t3^3 — Generation of NML Code 

4^ — Mapping Step 

4,4 XPP Loop Optimizations Driv e r 

30 4^ — Organization of the System 

4-4t3 — Loop Pr e paration 

4_4_3 — Transformation of the Data Dependonco Graph.. 50 

4AA — Influenc e of the Architectural Parameters 50 

4A.5 — Optimizations Towards Hardware Improvements 56 
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Annthr>r problem exists if an external access to data is requested in k nown devices used, inter 
aHa, to im plement functions in the configurabl e data processing logic cell field, DFP, FPGA, 

5 eic. that cannot be nrocessed sufficiently o n a CPU-integrated ALU. Accordingly, the data 
prnr.essing logic cell field is practicallv used to allow for use r -define d op codes that can 
prn.....: Hpta more efficien tly than is possi b le on the ALU of the CPU without fiirther support 
by tbP data processing logic cell field. Tn the prior art, the coupling is j^e nerally word-based, 
not block-based. A more efficient data processin g - in particular more efficient than possible 

10 with a close coupling yia registers, i s highly desirable. 

Another method for the use of logic cell fields that include coarse- and/or fine-granular logic 
cells and logic cell elements provides for a v ery loose coupling of such a field to a 
ronventional CPU and/or a CPU-core in embedde d systems. In this regard, a conventional 

15 seq uential program can be executed on the CPU, for example a program written in C, C++, 
etc.. wherein the instantiation or the data strea m processing by the fine- and/or coarse- 
praniilar data Processing logic cell field is effected v ia that sequential program. However, a 
problem exists in that for programming said logi c cell field, a program not written in C or 
another seq uential high-level languag e must be provided for the data stream processing. It is 

20 desirable to allow for C-programs to run both o n a conventional CPU-architectiire as well as 
on the data processing logic cell field operated ther e with., in particular, despite the fact tiiat a 
quasi-se quential program execution should maint a in the capability of data-sti-eaming in tiie 
data processing logic cell fields, whereas simultan eously the capability exists to operate tiie 
CPU in a not too looselv coupled way. 

25 
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Tt is already known to provide for sequential data processing w ithin a data processing logic 
cell field. See, for example. DE 1 96 51 075. WO 98/26356. DE 1 96 54 846, W O 98/29952, 
DR 197 04 728- WO 98/35299. DE 199 26 538. WO 00/77652. and D F. 102 12 621. Partial 
execution is achieved within a single configuratio n , for example, to reduce the amount of 
resources needed, to optimize the time of execution, etc. How ever, this does not lead 
automaticallv to allovying a programmer to transla t e or transfer high-level language code 
automatically onto a data processing logic cell fi e ld as is the case in common, machine 
models for s eq uential processes. The compilation, transfer, or translation of a high-level 
language code onto data processing logic cell fields ac cording to the methods known for 
models of sequentiallv executing m achines is difficult. 

Tn the prior art, it is fiirther known that configuration s that effect different fimctions on parts 
of the area respectively can be simultaneously executed on the processmg array and that a 
change of one or some of the configurationrs-) withou t disturbing other configurations is 
possible at run-time. Methods and hardware-implemented mea ns for the implementation are 
known to ensure that the execution of partial confi gurations to be loaded onto the array is 
possible vdthout deadlock. Reference is made to BE 196 54 59 3 . WO 98/ 31102 , DE 198 07 
872. WO 99/44147. DE 199 26538. WO 00/77652. DE 1 00 28 397. and WO 02/13000. This 
technology allows in a certain way a certain para llelism and, given certain forms and 
interrelations of the configurations or partial con figurations for a certain way of 
multitasking/multi-threading, in particular in such a wav that the planning, i.e., the scheduling 
and/or the planning control for time use, can be pro v ided for. Furthermore, fi-om the prior art, 
time use planning control means and methods are k nown that, at least under a corresponding 
interrelation of configurations and/or assignment o f configurations to certain tasks and/or 
threads to configurations and/or sequences of confi g urations, allow for a multi-tasking and/or 
multi-threading. 

SUMMARY OF THE INVENTION 

Embodiments of the present invention may improve upon the prior art with respect to 
optimization of the way in which a reconfigurable a r ray is coupled to other units and/or the 
way in which the array is configured. 
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A way out of limitations of conventional microprocessors may be a dynamic rec onfigurable 
processor datapath extension achieyed by integrating tr aditional static datapaths with the 
coarse- grain dynamic reconfigurable XPP-architecture (eXt reme Processing Platform). 
Fmhodiments of the present invention introduce a ne w concent of loosely coupled 
5 implementation of the dynamic reconfigu rable XPP architecture from PACT Corp. into a 
static data path of the SPARC compatible LEON processor. T hus, this approach is different 
from those where the XPP operates as a completely s eparate (master) component v^ithin one 
Confi gurable System-on-Chip (CsoCV together with a proc essor core, global/local memory 
tnpnlo pies. and efiScient multi-layer Amba-bus interfaces. See, for example, J. Becker & M. 
10 Vorbach. "Architecture. Memory and Interface Technology Integration of an 

TndustriaVAcademic Configurable Svstem-on-Chin fCSoO." IEEE Comyuter Society Annual 
Wnrkshon on VLSI (WVLSI 2003). (February 200 31 From the programmer's point of view, 
the extended and adapted datapath may seem like a dyn a mic configurable instruction set. It 
can he customized for a specific application and can accelerate the execution enor mously. 
15 Therefore, the programmer has to create a numbe r of configurations that can be uploaded to 
the XPP-Arrav at run time. For example, this configuration can be used like a fi lter to 
calculate stream-oriented data. It is also possi b le to configure more than one fimction at the 
same time and use them simultaneously. The se embodiments may provide an enormous 
performance boost and the needed flexibilit y and power reduction to perform a series of 
20 a pplications very effective. 

i Executive Summar}^ 



Tho otudy io concomod with thro e objoctiv e s: 

^ Prnpnr . n1 nf F mbodiments of the prese n t invention may provide a hardware 

25 framework, which eaabte smav enable an efficient integration of fliea PACT XPP core into a 
standard RISC processor architecture. 

3^ Prnpnnal n ffimbodiments of the present inventio n may provide a compiler for Aea 

coupled PJS C I XPP hnrriv i nr'- '^'^^ r-nmpilnr HnniH^^ + XPP hardware. The compiler may 
30 decide automatically which part of a source code is executed on the RISC processor and 
which part is executed on tiie PACT XPP core . 

Prcsontation of a number of caoo atudieo demonstrating which roaulto may bo achio^'ed 

v.y ..nine thn prnpn<-f> dTn an example embodiment of the present invention, a C Compilerjnay 
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be used in cooperation with the propoo e d h ardware framework, for the integration of the 
PACT XPP core and RISC processor. 

33mTn an exam ple embodiment of the present invention, the proposed hardware framework 
nrrnlnrnten mav accelerate the XPP core in two respects. First, data throughput is may be 
increased by raising the XPP's internal operating frequency into the range of the RISC's 
frequency. This, however, moons tha tm av cause the XPP fttn storun into the same pit likeas 
all high frequency processors-,./^ memory accessesjmiy become very slow compared to 
processor internal computations. This io why the uoo of a cache is propoood. It oaoes 
Accordingly- a cache mav be provided for use. The cache mav ease the memory access 
problem for a large range of algorithms, which are well suited for an execution on the XPP. 
The cache^ as_a second throughput increasing feature-^equifes, may require a controller. 
Hcnco a_A programmable cache controller is mtroducod, which monggeomay be provided for 
managing the cache contents and feeds feeding the XPP core. It docoupl e smay decouple the 
XPP corer computations from the data transfer so that, for ift-staneeinstance, data preload to a 
specific cache sector take smav take place while the XPP is operating on data located in a 
different cache sector. 

AnotherA problem pmnrpin pw hich mav emerge wdth a coupled RISC+XPP hardware is 
r.nnoQmod wt h concems the RISC's multi taolcingmultitasking concept. It beeemesmay 
become necessary to interrupt computations on the XPP in order to perform a task switch. 
Multitosldng io oupportod by the propoood oompilor, as well as by, the proposed hardware. 
Embodiments of the present invention mav provided for h ardware and a compiler that 
su pports multitasking. First, each XPP configuration ismay be considered as an 
uninterruptible entity. This means that the compiler, which generates the configurations, 
takes mav take care that the execution time of any configuration does rietoot exceed a 
predefined time slice. Second, the cache controller ts mav be concerned with the saving and 
restoring of the XPP's state after an interrupt. The proposedT cache concept minimiz e o may 
minimize the memory traffic for interrupt handling and frequently may even aHewsaUow 
avoiding memory accesses at all. 

EwiaM vTn an example embodiment of the present invention , the propoood cache concept 
i smav be based on a simple internal RAM (I RAM) cell structure' allowing for an easy 
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scalability of the hardware . For instance, extending the XPP cache size, for instance, 
feq«tfe smav require not much more than the duplication of IRAM cells. 



Ti-.n ptnHy prnpn'-Pn Tn an embodiment of the present invention, a compiler for a RISC+XPP 
jyjt o m. The objprt"'" ^^^^ nntr.pnn.r ir. thnt rnnl + XPP system may provide for 
com pilation for the RISC + XPP system of real w orld applications , which are written in the C 
languag e, con bo compilod for a RISC i XPP oyotom . The compiler femeves may remove the 
necessity of developing NML nSTative Mapp ing Language^ code for the XPP by hand. It 
i smavbe possible, instead, to implement algorithms in the C language or to directly use 
existing C applications without much adaptation to the XPP system. The prepesed-compiler 
i»Ai»a<^m?i Y include the following three major components to perform the compilation 

process for the XPP: 

1 . partitioning of the C source code into RISC and XPP parts?; 

2. transformations to optimize the code for the XPP; and 

3. generating of NML code. 

Fmally thoThe generated NML code i smav be placed and routed for the XPP. 

The partitioning component of the compiler deeidesmay decide which parts of an application 
code can be executed on the XPP and which parts are executed on the RISC. Typical 
candidates for becoming XPP code af emavbe loops with a large number of iterations whose 
loop bodies are dominated by arithmetic operations. The remaining source code - including 
the data transfer code -4s - may be compiled for the RISC. 

The pfepesed-compiler tmn r .fnrm F. mav transform the XPP code such that it is optimized for 
NML code generation. The transformations included in the compiler eempfis emay include a 
large number of loop transformations as well as general code transformations. Together with 
data and code analysis the compiler rPr.tniptiireRmay restructure the code so that it fit "sfits into 
the XPP array and_so that the final performance exeeedsmay exceed the pure RISC 
performance. Finally th e J The compiler genefatesmay generate NML code from the 
transformed program. The whole compilation process ismay be controlled by an optimization 
driver which selects the optimal order of transformations based on the source code. 
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xb^ ni<;rii<;';ed helow with respect to embodiments of the prese nt invention are case studies 
buiiJ J major aspect nf the study. Th e basis of the selection of tho oxamplesg juch is 
conducted by the guiding principle that each example st^tdsmav stand for a set of typical 
real-world applications. For each example the study demonstrates tho work of the propoaed 
comp iler. Firj t t h- '^-^ p"^'^'"" '^ '^ demonstrated t h e work of the compiler according to 
an embodiment of the present invention. For exa m ple, first partitioning of the code is 
discussed. The code transformations, which ar emav be done by the compiler, are shown and 
explained. Some examples require minor source code transformations which HHistaiaY be 
performed by hand. Tlio study argues that theseJ These transformations ar emay be either too 
expensive, or too specific to make sense to be included in the proposed compiler. Dataflow 
graphs of the transformed codes are constructed for each example, which af emay be used by 
the compiler to generate the NML code. In addition, the XPP resource usages are shown. 
The case studies demonstrate that a compiler containing the proposed transformations can 
generate efficient code from, numerical applications for the XPP. This is possible because 
the compiler feUe smav rely on the features of the suggested hardware, like the cache 
controller. 

Other embodiments of the present invention pertai n to a realization that for data-streaming 
data-nrocessing. block-based coupling is h ighly preferable. This is in contrast to a word- 
based coupling discussed above wi th respect to the prior art. 

Further, embodiments of the present invention pro v ide for the use of time use planning 
control means, discussed above with respect to their use in the prior art, for configuring and 
management nf configurations for the pur pose of scheduling of tasks, threads, and multi- and 
hyper-threads. 

BRIEF DESCRIPTION OF THE DR AWINGS 

Fig. 1 illustrates components of a LEON ar chitecture. 

Fig. 2 shows the pipelined datapath structure of the LEON integer unit. 
Fig. 3 illustrates components of a typical PAE. 
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Fi p 4 is a diagram that illustrates an extended d a tapath according to an example embodiment 
of the present invention. 

Fi p S ilhistrates transmission of data to an extende d XPP-hased datapath by passing the data 
5 through an lO-Port. according to an example emb odiment of the present invention. 

Fi p 6 illustrates an extended LEON instructio n pipeline, according to an example 
embodiment of the present invention. 

10 Fi g. 7 is a graph that shows that the benefit brought bv XPP rises with the number of iPCT 
blocks computed bv it befo re reconfiguration. 

Fi p X is a block diagram of an MPEG-4 decod ing algorithm, according to an example 
embodiment of the present invention. 

15 

Fi p. Q is a block diagram illustrating compon e nts of an example embodiment of the present 
invention, where an XPP core and a RISC core share a memory hierarchv. 

Fi p 1 0 shows an IRAM and configuration cac he controller data structures and a usage 
20 example, according to an example embodiment of the present invention. 

Fi p 1 1 shows an asynchronous pipeline of an XPP. according to an example embodiment of 
the present invention. 

25 Fig. 1 2 is a diagram that illustrates tasks of an XPP cache controller as states, according to an 
exam ple embodiment of the present invention. 

Fi p 1 3 shows simiiltaneous multithreading a c cording to an example embodiment of the 
present invention. 

30 

Fip. 14 shows an ex am ple of a cache structure according to an example embodiment of the 
present invention. 
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Fi p 1 S is a control-flow graph of a piece of a program, accor ding to an example embodiment 
of the present invention. 

Fi p, 1 6 illustrates a code and diagram of an example of a true dependence with distance 0 on 
5 array 'a', according to an example embodiment of the present invention. 

Fi p. 1 7 illustrates a code and diagram of an exa m ple of an anti -dependence with distance 0 on 
arra y 'b'. accordinp to an exa m ple embod iment of the present invention. 

10 Fig. 18 illustrates a code and diagram of an example o f an output dependence with distance 0 
on array 'a', according to an example em bodiment of the present invention. 

Fi p. 1 9 illustrates a code and diagram of an example o f a dependence with direction 
vectorC^.^^ between SI and S2 and a dependence with d irection vector between S2 

15 and S2. according to an example embodiment of the present invention. 

Fi p. 20 illustrates a code and diagram of an example of an anti-dependence with distance 
vector (0.2). according to an example em bodiment of the present invention. 

20 Fig. 21 is a graph illustrating information of a flow -sensitive alias analysis versus a flow 
insesitive alias analysis, according to an exam p le embodiment of the present invention. 

Fig. 22 is a diagram that illu.strates aligne d and misalipned memory accesses. 

25 Fi p. 23 illustrates merging of arrays, according to an example embodiment of the present 
invention. 

Fip. 24 is a flowchart that illustrates a global v i ew of a compiling procedure, according to an 
example embodiment of the prese nt invention. 



30 



Fig. 25 is a flowchart that illustrates a detailed architect ure and an internal processing of an 
XPP Compiler. 
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Fi p 26 is a diagram that illustrates details of XPP loop ontimizations, inclu ding their 
nr panization. according to an example embodime nt of the present invention. 



Fi p 77 is an expression tree of an edge 3x3 in ner loop body, according to an example 
5 embodiment of the present invention. 

Fi p is an expression tree showing the inte rchanging of operands of commutative add 
pv pressions to reduce an overall tree depth, acc ording to an example embodiment of the 
present invention. 

10 

Fi p 79 shows a main calculation network of an edge 3x3 configuration, according to an 
exam ple embodiment of the present invention. 

Fip 30 shows a case of synthesized shift regist e rs, according to an example embodiment of 
15 the present invention. 

Fi p. 31 is a data dependency graph relating to a FIR filter, according to an example 
embodiment of the present invention. 

20 Fip. 32 is a dataflow graph that is achieved in an i nstance where values of x needed for 

rnm pntation of V are kept in registers, according t o an exam p l e embo diment of the present 
invention. 

Fip 33 is a dataflow graph representing an in ner loop with loop unrolling, according to an 
25 example embodiment of the prese nt invention. 

Fi p 34 is a data dependency graph for matri x multiplication, according to an example 
embodiment of the present invention. 

30 Fig. 35 is a visualization of array access se quences prior to optimization according to an 
example embodiment of the present i nvention. 

Fip. 36 is a visualization of array access sequenc e s subsequent to optimization according to 
an example embodiment of the prese nt invention. 
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Fip ^7 is a dataflow graph of a synthesized configuratio n and shows matrix multiplication 
aftpr unroll and iam. according to an example embodiment of the present invention. 

5 Fi p. 38 is a data flow eranh corresponding to a butterfly loop, according to an example 
embodiment of the present invention. 

Fi p 39 is a data flow graph showing modifications to co d e corresponding to the graph of Fig. 
38. according to an example embodiment of the p resent invention. 

10 

Fi p. 40 illustrates a splitting network, according to an e xample embodiment of the present 
invention. 

Fi p 41 is a diagram that illustrates how short values ar e handled, according to an example 
15 embodiment of the present invention. 

Fip. 42 is a diagram that illustrates how a m erge is done, according to an example 
embodiment of the present invention. 

20 Fig. 43 illustrates a changinp of values of a block row bv row before processing of columns. 

Fi p. 44 illustrates a possible implementation fo r saturatefval.n') as an NML schematic using 
two ALUs, according to an example embodiment of the present invention. 

25 Fig. 45 is a data flow graph for TDCTCO TAJMN CONFIG. 

Fi p. 46 is a diagram that illustrates use of tw o counter macros for address generation, 
according to an example embodiment of t he present invention. 

30 Fig. 47 is a diagram that illustrates an idling of u nits of a deep pipeline. 

Fip. 48 illustrates a loon interchange, accordi ng to an example embodiment of the present 
invention. 
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Fi p 49 illustrates use of output IRAM of Config A as input IRA M of Config B to bypass a 
memory interface for bandwidth optimization, according to an exa mple embodiment of the 
present invention. 

5 Fip. 50 illustrates block offsets inside tiles generated bv a SUIP counter, according to an 
example embodiment of the present invention- 
Pi p. 5 1 illustrates a difference in efficiency between a n instance where there is no data 
du plication and instance where there is data du p lication according to an example embodiment 

10 of the present invention. 

Fip 5?, illustrates IDCTROW CONFIG. IDCTC OT.UMN CONFIG, and 
REORDER CONFIG of an example embodiment o f the present invention. 

15 Fig. 53 is a dataflow graph of loop bodies of wavelet after per formance of a step of tree 
balancing, according to an example em bodiment of the present invention. 

Fig. 54 is a graphical representation of functions for pr ocessing data and event packets that 
can be configured into an RDFP. 

20 

Fips. 55-69 each illustrates a CDFG according to a respe ctive embodiment of the present 
invention. 

DF.TATLED DESCRTPTION OF THE INVENTION 

25 Environment 

Instruction datapaths of modem microprocessors are constrained by cer tain limitations 
because thev use static instruction sets driven h v the traditional von Neumann or Harvard 
architectural principles. These limitations mav be avoided via a dynamic reconfigurable 
processor datapath extension achieved bv inte g rating traditional static datapaths with the 

30 coarse-grain dvnamic reconfipurable XPP a rchitecture. Therefore, a loosely asynchronous 
cou pling mechanism of the corresponding in struction datapath or datapath units has been 
deve1ot)ed and integrated onto a CMOS 0.13u m standard cell technology fi-om UMC. In 
embodiments of the present invention, the SPARC com patible LEON RISC processor may be 
used, with its static pinelined instruction datapa th extended to be configured and personalized 
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fnr f^ pecific applications. This compiler-compa t ihle instruction set extension allows various 
and efficient uses, e.g.. in streamin p ap plicatio n domains like MPEG-4. digital filters, mobile 
rnmtTmnication mod ulation, etc. Discussed belo w is a coupling technique by flexible dual- 
rlnck FIFO interfaces that allows asynchronous concu r rency of the additionally configured 
5 com pound instructions, which are integrated into th e programming and compilation 
environment of the LEON processor, and that allows ada ption of the frequency of the 
ronfi pured XPP datapath, dependent on actua l performance requirements, e.g., for avoiding 
unneeded cycles and reducing pow er consiunption. 

10 The coupling technique of embodiments of the prese n t invention discussed below combines 
the flexibility of a general purpose microorocesser w ith the performance and power 
consum ption of coarse-grain reconfigurable d atapath structures, nearly comparable to ASIC 
performance. Two programming and comput ing paradigms (control-driven von Neumann 
and transport-triggered XPPI are unified within one hybri d architecture with the option of 

15 two clock domains. The ability to reconfigure the transport-triggered X PP makes the system 
inde pendent from standards or specific applic a tions. This concept creates potential to 
develop multi-standard communication device s like software radios by using one extended 
processor architecture with ad a pted progra mming and compilation tools. Thus, new 
standards can be easilv implemented through sof t ware updates. The system is scalable during 

20 design time through the scalable arrav-stnicture of the u sed XPP extension. This extends the 
r^in pe of suitable applications from products w ith less multimedia fiinctions to complex high 
performance systems. 

T.EON RISC Microprocessor 

25 Embodiments of the present invention may be i mplemented using a 32-bit SPARC V8 

rom patible LEON microprocessor. See SPARC I n ternational Inc.. The SPARC Architecture 
Mnnnal. Version 8. at http://www.sparc.com: Jiri Gaisler. The LEON Processor User's 
Manual, at http://www.gaisler.com. This m i croprocessor is a synthesi sable, freely available 
VHDL model which has a load/store architecture and has a five stages pi peline 

30 implementation with separated inst ruction and data caches. 

Fip 1 illustrates components of a LEON architect u re. The LEON may be provided with a 
fiill im plementation of an AMBA 2.0 AHB and APB on-chip bus (1 OOP, 1002), a hardware 
multi plier and divider, programmable 8/16/32-bi t memory controller 1005 for external 
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PROM, static RAM and SDRAM, and several on-chip peripherals such as timers 1010, 
TJARTs 1012. an interrupt controller 1014. and a 16-bit T/O port 101 6- A simple power down 
mode mav be implemented as well. 

T.EON is developed bv the European Space Agency rESA^ for future space missions. The 
performance of T.EON is close to an ARM! 9 series but d oes not have a memory management 
unit (MMU^ implementation, which limits the use t o single memory space applications. Fig. 
2 shows the p i pelined datapath struct ure of the LEON integer unit. 

extreme Processing Platform - XPP 

Embodiments of the present invention mav be i m plemented using the XPP architecture. 
Regarding the XPP architecture, see htt p ://www.pactc or p.com: "The XPP Communication 
System." supra: and V. Baumgarte et al.. "A Self-Rec onfigurable Data Processing 
Architecture." The 1st Intl. Conference nfEnmneerin f of Reconfisurahle Systems and 
Algorithms (ERSA '01). Las Vegas. NV (Ju n e 200n. The XPP architecture is based on a 
hierarchical array of coarse-grain, adapt ive computing elements called Processing Array 
Elements fPAEs) and a packet-oriented commun ication network. The strength of the XPP 
technology originates from the combination of arra v processing with unique, powerful run- 
time reconfiguration mechanisms. Since configur ation control is distributed over a 
Cnnfieuration Manager (CM) embedded in the arrav. P AEs can be configured rapidly in 
parallel while neighboring PAEs are processing da t a. Entire applications can be configured 
and run independently on different parts of the array. Reconfiguration mav be triggered 
externally or even bv special event signals originating vyith in the array, enabling self- 
reconfiguring designs. Bv utilizing protocols impl e mented in hardware, data and event 
packets mav be used to process, gener ate, decompose and merge streams of data. 

The XPP has some similarities with other coarse - grain reconfgurable architectures like the 
KressArrav (see R. Hartenstein et al. - sunra^ or Raw Machines (see E. W aingold et al., 
supraX which are specifically designed for stream-based applications. XPP's main 
distinguishing features are its automatic packet-ha n dling mechanisms and its sophisticated 
hierarchical configuration protocols for runt ime and self reconfiguration. 



Arrav Structure 
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A CM may include a state machine and internal RAM for co n figuration caching. The PAE 
itself (see too right-hand side of Fig. 3^ mav include a con figuration bus which connects the 
PM with PAEs and other configurable objects. Horizon tal busses may carry data and events. 
The y can be segmented by configurable switch-obiects. and can he connected to PAEs and 
5 special I/O objects at the periphery of the deyice. 

A PAR is a collection of PAE objects. Fig. 3 illustrates com ponents of a typical PAE, which 
may include a BREG object (back registers^ 1 100 a nd an FREG object (forward registers) 
1 1 02. which are used for vertical routing, as well as an A T J J object 1 104 which performs the 

10 actual computations. The ALU 11 04 mav perform com mon fixed-point arithmetical and 
logical operations as well as several special three input opcodes, such as multiply-add, sort, 
and counters. Events generated by ALU objects depend on ALU results or except ions, very 
similar to the state flags of a conventional micropro cessor. A coimter, e.g., generates a 
s pecial event only after it has terminated. How these eve nts are used is dicussed below. 

15 Another PAE object implemented in the XPP is a m e mory object which can be used in FIFO 
mode or as RAM for lookup tables, intermediat e results, etc. However, any PAE object 
fiinctionalitv can be included in the XPP architecture. 

Packet Handling and Synchronization 

20 PAE objects, as defined above, mav communicate via a pac k et-oriented network. Two types 
of packets mav be sent through the array : data packets and event packets. Data packets have 
a uniform bit width specific to the device type. In norm al operation mode, PAE objects are 
self-synchronizing. An operation is performed as soon as all necessary data input packets are 
available. The results are forwarded as soon as thev are available, provided the previous 

25 results have been used. Thus, it is possible to map a signal-flow graph directly to ALU 
obj ects. Event packets are one bit wide. They transmit state information which controls 
ALU execution and packet generation. 



Configuration 



30 Every PAE stores locally its current configuration stat e , i.e.. if it is part of a configuration or 
not (states "configured" or "fi-ee"^ Once a PAE is configured, it changes its state to 
"configured." This prevents the CM fi-om recon fi guring a PAE which is still used by another 
a pplication. The CM caches the configuration data in its internal RAM until the required 
PAEs become available. 
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While loading a configuration, all PAEs start to compu te their part of the application as soon 
as thev are in state "configured." Partially configure d applications are able to process data 
without loss of packets- This concurrency of configuration and computation hides 
configuration latency. 

XPP Application Mapping 



The NML l? "P"ap«, « PACT proprietary structur a l language with reconfiguraton primitives, 
was deyeloped by PACT to man applications to the XPP array. It eiyes the programmer 
direct access to all hardware features. 

In NML. configurations consist of modules which ar e specified as in a structural hardware 
description language, similar to. for example, structural VHDL. P AK objects are explicitly 
allocated, optionally placed, and their conn ections specified. Hierarchical modules allow 
component reuse, especially for repetitive layouts. Additi on all y, NML includes statements to 
sup port configuration handling. A comple t e NML application program may include one or 
more modules, a sequence of initially configured m odules, differential changes, and 
statements which map event signals to configurati on and prefetch requests. Thus, 
configuration handling is an explici t part of the application program. 

XPP-based architectures and development tools, such as the PACT XPP Development Suite 
(XDS) are discussed in detail at http://vmw .pactcorp.com. 

LEON Instruction Datapath Extension 

LEON and XPP should be able to communicate with each other in a simple and high 
performance manner. While the XPP is a d ataflow orientated device, the LEON is a general 
purpose processor, suitable for handlin p control flow. See, for example, The SPARC 
Architecture Manual suvra; Jin Gaisler. suvra. T herefo re, LEO N may be used for system 
control. To do this, the XPP is integrated into the datapa th of the LE ON integer unit, which 
is able to control the XPP. Fig- 4 is a dia g ram that illustrates this extended datapath. 

Due to unpredictable operation time of th e XPP algorithm, integration of XPP into LEON 
data-path is done in a loosely-coupled wav. Thus, the XPP array can operate independently 
of the LEON, which is able to control and reconfigur e the XPP durinp runtime. Since the 
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rnnfi puration of XPP is handled bv LEON, the CM 1 106 o f the XPP is unnecessary and can 
he left out of the XPP array. The configuration codes are s t ored in the LEON RAM. LEON 
transfers the needed configuration from its system RAM into the XPP and creates the needed 
al gorithm on the array. 

To enable a maximum of independence of XP P from LEON, all ports of the XPP - input ports 
as well as output norts - are buffered using d u al clock FIFOs. Dual-clocked FIFOs are 
implemented into the lO-Ports between LE O N and XPP. To transmit data to the extended 
XPP-based d ata path, the data are passed through an l O -Port as shown in Fie. 5. In addition 
to the FIFO, the lO-Ports include logic to generate h andshake signals and an interrupt request 
signal. The lO-Port for receiving data from XPP is simil ar to Fig. 5 except with a reversed 
direction of the data signals. This enables XPP to perfo rm completely independently of 
LEON as lonp as there are input data available in the input port FIFOs and free space for 
result date in the output port FIFOs. There are a numb er of additional features implemented 
in the LEON pipeline to confrol the data tran sfer between LEON and XPP. 

When LEON tries to write to an lO-Port containin g a full FIFO or read from an lO-Port 
containing an emntv FIFO, a trap is generated. This trap can be handled through a trap 
handler. A further mechanism, e.g.. pipeline-holding, may be implemente d to allow LEON 
to hold the pipeline and wait for free FIFO s pace during XPP write access or wait for a valid 
FIFO value during XPP read access. When using p i peline-holding, the software developer 
has to avoid reading from an lO-Port with an emntv FIFO while the XPP, or the XPP input 
TO-Ports. includes no data to produce output. In thi s case a deadlock will occur requiring a 
reset of the complete system. 

XPP can generate interrupts for the LEON when try i ng to read a value from an empty FIFO 
port or to write a v?^l»e tn a frill FIFO port- The occur rence of interrupts indicates that the 
XPP array cannot process the next step because it ha s either no input values or it cannot 
out put the resuh value. The interrup t s generated by the XPP are maskable. 

The interface provides information about the FIFOs. LEON can read the number of valid 
values that are in the FIFO. 
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Fi p. 6 illustrates an extended LEON instruction pip e line. The interface, shown in Fig. 6, to 
the XPP appears to the LEON as a set of special registe rs. These XPP registers can be 
divided into a communication register categrov a nd a status register category. 

For data exchange, the XPP communication registe r s are used. Since XPP provides three 
Hifferpnt t ypes of communication ports, there are als o three types of communication registers. 
each type is split into an input pa rt and an output part. 

Commvinication Registers 



30 



10 The data for the process are accessed throug h XPP data registers. The number of data input 
i^nd data output POrts. as well as the data bit-width d e pends on the implemented XPP array. 

XPP can generate and consume events. Even ts are one bit signals. The number of input 
events and output events also depends on the implemented XPP array. 

15 

rnnfi juration of the XPP is done through the XPP configuration register. LEON reads the 
rt>q .iired configuration value fi-om a file stored in i t s svstem RAM and writes it to the XPP 
configuration register. 

20 Status Resisters 

There are a number of XPP status registers imp lemented to control the behavior and get 
status information of the interface- Switching b etween the usage of trap handling and 
p i ppline holding can be done in the hol d register. An XPP clock register includes a clock 
freq iiencv ratio between LEON and X P P- Rv writing to this register. T,KON software can set 

25 the XPP clock relative to the LEON clock. This allows adaptation of th e XPP clock 

freq uencv to the required XPP performance and consequen tly allows for influencing the 
power consumption of the svstem. W ri ting zero to the XPP clock register turns off the XPP. 
There is also a status register for every FIFO including the number of valid values actually 
available in the FIFO. 



This status repisters provide a high degree of flexibility in communication between LEON 
and XPP and enables different communication modes. 



Modes 
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Tf there is only one application running on the system at a particular time, software may be 
developed in pipeline-hold mode. In this instance. LEO N initiates data read or write from or 
to XPP. If there is no value to read or no yalue to write. LEON pipeline will be stopped until 
read or write is possible. This can be used to reduc e power consumption of the LEON part. 

In interrupt mode. XPP can influence the LEON program flow. Thus, the lO-Ports generate 
an interrupt depending on the actual number of values available in the FIFO. The 
communication between LEON and XPP is via int errupt service routines. 

Polling mode is a wav to access the XPP- Initiated bv a timer event. T,EON reads XPP ports 
including data and writes to XPP ports including free FIFO space. Between these phases, 
LEON can compute other calculations. 

It is possible to switch between these strategies a nytime within one application. 

A conventional XPP includes a configuration m anager to handle configuration and 
reconfiguration of the arrav. However, in combination with the LEON, the configuration 
manager is dispensable because the configuration as well as any reconfiguration is controlled 
bv the LEON through the XPP configuration register. M l XPP configurations used for an 
a pplication are stored in the LEON' s system RAM. 

Tool and Compiler Integration 

To make the new XPP registers accessable th rough software, the LEON's SPARC 8 
instruction set (see The SPARC Architecture Manual s upra ) is extended by a new subset of 
instructions. These instructions are based on the SPARC instruction for mat, but do not 
conform to the SPARC VS standard. Corresponding to the SPARC conventions of a 
load/store architecture, the instruction su bset can be divided into two categories. Load/store 
instructions can exchange data between t he LEON memory and the XPP communication 
registers. The number of cycles per instruction is similar to the standard load/store 
instructions of the LEON. Read/write instructions a re used for communications between 
LEON registers. Since the T -HON register-set is extende d bv the XPP registers, the read/write 
instructions are also extended to access XP P registers. Status registers can only be accessed 
with read/yyrite instructions. Execution of arithmet ic instructions on XPP registers are not 
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possible. Values have to be written to standard LEON registe rs before they can be targets of 
arithmetic operations. 



The complete system can still execute anv SPARC V8 comoa tiple code. Doing this, the XPP 
is completely unused. 

The LEON is proyided with the LECCS cross compiler system (see LEON/ERC32 Cross 
Compilation System (LECCS) at 

http://www.paisler.com/cms4 5 3/index.php?option=com co ntent&task=yiew&id=62&Item 
id=149'> under the terms of LGPL. This system includes m odified-versions of the binutils 
2. 11 and gee 2.95.2. To make the new instruction subset av ailable to software deyelopers, 
the assembler of the binutils has been extended by a number of instructions according to the 
im plemented instruction subset. The new instructions have the same mnemonic as the 
regular SPARC V8 load, store, read, and write instru ctions. Onlv the new XPP registers have 
to be used as a source or target operand. Since the modifications of LECCS are 
straightforward extensions, the cross compiler system is ba ckward compatible to the original 
version. The availability of the source code of LECC S has allowed for extending the tools by 
the new XPP operations in the described way. 

The development of the XPP algorithms have to be done with separate tools, provided by 
PACT Corp. 

Application Results 

As a first analysis application, an inverse Discrete Cosine T r ansform (BCD applied to an 8x8 
pixel block was implemented. For all simulations, a 2.50 MHz clock frequency for the 
LEON processor and a 50 MHZ clock frequency for XPP was used. The usage of XPP 
accelerates the computation of the iDCT bv about a factor of four, depending on the 
communication mode. However. XPP has to be con figured before computing the iDCT on it. 
The following table shows the co nfi guration time for this algorithm. 
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T IT ON with XPP 


l>F,ON with XPP 


LEON with 




in IRO Mode 


in Poll Mode 


XPP 
in Hold Mode 


Configuration 
ofXPP 




71.308 ns 
17.827 cycles 


84.364 ns 
21.091 cycles 


77.976 ns 
19.494 cycles 


2D iDCT (8x8) 


14.672 ns 
3.668 cycles 


3.272 ns 
818 cycles 


3.872 ns 
968 cycles 


3.568 ns 
892 cycles 



10 



15 



As shown in Fig. 7. the benefit brought by XPP rises w ith the number of iPCT blocks 
com puted by it before reconfiguration. Accordingly, th e number of reconfigurations during 
complex algorithms should be minimized. 

A first complex application implemented on th e system is MPEG-4 decoding. The 
opti mization of the algorithm partitioning on LEON and XPP is still in progress. Fig. 8 is a 
block diagram of the MPErT-4 decoding algorithm. F rames with 320 x 240 pixels were 
decoded. LEON, bv using SPARC V8 standard instructions, decodes one frame in 23.46 
seconds. In a first implementation of MPEG -4 using the XPP. only the iPCT is computed by 
XPP. The rest of the MPEG-4 decoding is still done with LE ON. With the help of XPP, one 
fi^me is decoded in 17.98s. This is a performance b o ost of more then twenty percent. Since 
the XPP performance gain by accelerating the iPCT algorithm only is yery low at the 
moment, we work on XPP implementations of Hu f fmann-decoding. dequantization, and 
prediction decoding. So the performance boost of t his implementation against the standalone 
LEON will be increased. 



-Hardware 



24 Design Parameter Changes 

20 SifteeForintegrationof the XPP core shall bo integrated as a fimctional unit into a standard 
RISC core, some system parameters hffve^emay be reconsidered as follows: 



— Pipelining / Concurrency / Synchronicity 
RISC instructions of totally different type (Ld/St, ALU, MuL/Diy/MAC, FPALU, FPMuW 
25 etc.) af emay be executed in separate specialized functional units to increase the fraction of 
silicon that is busy on average. Such functional unit separation has led to superscalar RISC 
designs^ that exploit higher levels of parallelism. 
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Each functional unit of a RISC core is- mav be h ighly pipelined to improve throughput. 
Pipelining tweria ^mav overlap the execution of several instructions by splitting them into 
unrelated phases, which M^ mav be executed in different stages of the pipeline. Thus, 
different stages of consecutive instructions can be executed in parallel with each stage taking 
much less time to execute. This atW smav allow higher core frequencies. 

&4«e eWith an approximate subdivision of the pipelines of all functional units ^ 
approximately oubdivided into sub-operations of the same size (execution time), these 
functional units / pipelines may e xecute in a highly synchronous manner with complex 
floating point pipelines being the exception. 

Since the XPP core uses d^afle wdata flow computation, it is pipelined by design. However, 
a single configuration usually implements a loop of the application, so the configuration 
remains active for many cycles, unlike the instructions in every other functional unit, which 
typically execute for one or two cycles at most. Therefore, it is still worthwhile to consider 
the separation of several phases, (e.g.-, Ld / Ex / Store), of an XPP configuration, ([[=]] le^ 
an XPP instruction), into several functional units to improve concurrency via pipelining on 
this coarser scale. This also impffwe smav improve throughput^ and response time in 
conjunction with multi tasking operations and implementations of simultaneous 
multithreading (SMT). 

The multi cycle execution time may also ferbidsforbid a strongly synchronous execution 
scheme and may r ather leadslead to an asynchronous scheme, e.g., like for ergr-floating point 
square root units. This in turn nnr,nr.r.itntenmav necessitate the existence of explicit 
synchroni2»tion instructions. 

Jri^S — C£»r£L f''*'q""""'' P''gq^g"<^v ^ Memory Hierarchy 

As a functional unit, the XPP's operating frequency wiiimay either be half of the core 
frequency or equal to the core frequency of the RISC. Ahnost every RISC core currently on 
the market exceeds its memory bus frequency with its core frequency by a larger factor. 
Therefore, caches are employed, forming what' is commonly called the memory hierarchy- 
Eae b. where each layer of cache is larger but slower than its predecessors. 
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This memory hierarchy does not help to speed up computations which shuffle large amounts 
of data, with little or no data reuse. These computations are called "bounded by memory 
bandwidth." However^ other types of computations with more data locality (another 
aam eterm for data reuse) jnaji gain performance as long as they fit into one of the upper 
layers of the memory hierarchy. This is the class of applications that gaingmns the highest 
speedups when a memory hierarchy is introduced. 

Classical vectorization can be used to transform memory-bounded algorithms, with a data set 
too big to fit into the upper layers of the memory hierarchy. Rewriting the code to reuse 
smaller data sets sooner exposes memory reuse on a smaller scale. As the new data set size is 
chosen to fit into the caches of the memory hierarchy, the algorithm is not memory bounded 
any mor e anvmore, yielding significant speed-ups. 

— Software / Multitasking Operating Systems 
As the XPP is introduced into a RISC core, the changed environment - higher fi-equency and 
the memory hierarchy - may necessitate , not only reconsideration of hardware design 
parameters, but also a reevaluation of the software environment. 

Memory Hierarchy 

The introduction of a memory hierarchy awhaneesmav enhance the set of applications that 
can be implemented efficiently. So far, the XPP has mostly been used for algorithms that 
read their data sets in a linear manner, applying some calculations in a pipelined fashion and 
writing the data back to memory. As long as all of the computation fits into the XPP array, 
these algorithms are memory bounded. Typical applications are filtering and audio signal 
processing in general. 

But there is another set of algorithms^ that have even higher computational complexity and 
higher memory bandwidth requirements. ^Examples are picture and video processing, where 
a second and third dimension of data coherence opens up. This coherence is, e.g., exploited 
by picture and video compression algorithms? that scan pictures in both dimensions to find 
similarities, even searching consecutive pictures of a video stream for analogies. >Jaturally 
thes e These algorithms have a much higher algorithmic complexity as well as higher 
memory requirements. Yet they are data local, either by design or thoy can bo transform e d to 
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hf^h y transformation , thus efficiently exploiting the memory hierarchy and the higher clock 
frequencies of processors with memory hierarchies. 

Multi Tasking 

The introduction into a standard RISC core makes it necessary to understand and support the 
needs of a multitasking operating system, as standard RISC processors are usually operated in 
multitasking environments. With multitasking, the operatmg system s^f^4tehe smay switch the 
executed application on a regular basis, thus simulating concurrent execution of several 
applications (tasks). To switch tasks, the operatmg system hasmayhave to save the state, 
(e.g., the contents of all registers), of the running task and then reload the state of another 
task. Hence, it i smav be necessary to determine what the state of the processor is, and to 
keep it asr small as possible to allow efficient context switches. 

MedemModem microprocessors gain their performance from multiple specialized and deeply 
pipelined functional units and high memory hierarchies, enabling high core frequencies. But 
high memory hierarchies mean that there is a high penalty for cache misses- due to the 
difference between core and memory frequency. Many core cycles maiLpass until the values 
are finally available from memory. Deep pipelines incur pipeline stalls due to data 
dependencies as well as branch penalties for mispredicted conditional branches. Specialized 
fimctional units like floating point units idle for integer-only programs. For these reasons, 
average fimctional vmit utilization is much too low. 

The newest development with RISC processors. Simultaneous MultiThreading (SMT), adds 
hardware support for a finer granularity (instruction / fimctional unit level) switching of tasks, 
exposing more than one independent instruction stream to be executed. Thus, whenever one 
instruction stream stalls or doesn't utilize all fimctional units, the other one can jump in. This 
improves fimctional unit utilization for today's processors. 

With SMT, the task (process) switching is done in hardware, so the processor state has to be 
duplicated in hardware. So again it is most efficient to keep the state as small as possible. 
For the combination of the PACT XPP and a standard RISC processor, SMT ismaybe very 
beneficial, since the XPP configurations may e xecute longer than the average RISC 
instruction. Thus, another task can utilize the other fimctional units, while a configuration is 
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running. On the other sidehand, not every task will utilize the XPP, so while one such non- 
XPP task is running, another one will be able to use the XPP core. 

3L3_— rnmmunication Between the RISC Core and the X PP CorcT 

Ift-ttieThe following section introduces are several possible embodiments that are each a 

possible h ardware unplntrinntiitinns implementation for accessing memory. 

2,2i4 — Streaming 

Since streaming can only support (number_of_IO jorts * width_of_IO_port) bits per cycle, it 
is-e«l vmavbe well suited for only small XPP arrays with heavily pipelined configurations 
that feature few inputs and outputs. As the pipelines take a long time to fill and empty while 
the running taetime of a configuration is limited (as described aade fherein vvith respect to 
"context switches"), this type of communication does not scale well to bigger XPP arrays and 
XPP frequencies near the RISC core frequency. 

« Streaming from the RISC core 

In this setup, the RISC sapplie smav supply the XPP array with the streaming data. Since the 
RISC core ha smav have to execute several mstructions to compute addresses and load an 
item from memory, this setup is only suited, if the XPP core is reading data with a frequency 
much lower tiian the RISC core frequency. 

• Streaming via DMA 

In this mode the RISC core only initializes a DMA channel which may then suppliessuEEly 
the data items to the streaming port of the XPP core. 

2j2j3 — Shared Memory (Main Memory) 

In this configuration, the XPP array configuration usesmay use a number of PAEs to generate 
an address thatT is used to access main memory through the lO ports. As the number of lO 
ports i smavbe very limited, this approach suffefsmay suffer from the wiaesame limitations as 
the previous one, although for larger XPP arrays there is less impact of using PAEs for 
address generation. However, this approach ismaY still be.usefiil for loading values from 
very spM ssparse vectors. 



XJU3 — Shared Memory (IRAM) 
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This data access mechanism uses the IRAM elements to store data for local computations. 
The IRAMs can either be viewed as vector registers or as local copies of main memory. 



Thefe The following are several ways in which t o fill the IRAMs with dataTi 

1 . The IRAMs af emav be loaded in advance by a separate configuration using 
streaming. 

This method can be implemented with the current XPP architecture. The IRAMs 
act as vector registers. As explicated above, this wiHmay limit the performance of 
the XPP array, especially as the IRAMs will always be part of the externally 
visible state and hence must be saved and fe-stefe^^tored on context switches. 

2. The IRAMs eaamav be loaded in advance by separate load-instructions. 

This is similar to the first method. Load-instructions afemay be implemented in 
hardware which lea dloads the data into the IRAMs. The load-instructions can be 
viewed as a.hard coded load configuration. Therefore, configuration reloads 
are mav be reduced. Additionally, the special load- instructions may use a wider 
interface to the memory hierarchy. Therefore, a more efficient method than 
streaming can be used. 

3 . The IRAMs can be loaded by a "burst preload from memory" instruction of the 
cache controller. No configuration or load-instruction is needed on the XPP. The 
IRAM load i smav be implemented in the cache controller and triggered by the 
RISC processor. But the IRAMs may still act as vector registers and af emay be 
therefore included in the externally visible state. 

4. The best mode, however4 s. mav be a combination of the previous solutions with 
the extension of a cache: 

A preload instruction maps mav map a specific memory area defined by starting 
address and size to an IRAM. This triggefsmay trigger a (delayed, low priority) 
burst load from the memory hierarchy (cache)-^After all IRAMs are mapped, the 
next configuration can be activated. The activation ineufsmay incur a wait until 
all burst loads are completed. However, if the preload instructions are issued long 
enough in advance and no interrupt or task switch destroys cache locality, the wait 
will not consume any time. 
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To specify a memory block as output-only IRAM, a "preload clean" instruction 
i smav be used, which aveids mav avoid loading data from memory. The "preload 
clean" instruction just indicates the IRAM for write back. 

A synchronization instruction i smav be needed to make sure that the content of a 
specific memory area, which is cached in IRAM, is written back to the memory 
hierarchy. This can be done globally (full write back), or selectively by 
specifying the memory area, which will be accessed. 

3,3 State of the XPP Core 

As dc-ri^--^ r*-"^"""" rnntm ndiscussed above . the size of the state i smay be crucial for 
the efficiency of context switches. However, although the size of the state w-masibe fixed for 
the XPP core, ^4^i^=^p ^whether or not thev have to be s aved mav depend on the declaration 
of the various state elements , whether thoy have to bo oavod or not . 

The state of the XPP core can be classified as: 
4-1. Read only (instruction data) 

. , configuration data, consisting of P AE configuration and routing configuration 

dat a: and 

22, Read - Write 

. a ^the contents of the data registers and latches of the P AEs, which are driven 

onto the busses 

• a ^the contents of the IRAM elements, 

2rSA — Limiting Memory Traffic 

There are several possibilities to limit the amount of memory traffic during context switchesr, 

as follows: 

Do not s ave read only data 

Do Not Sav« Read-Onlv Data 

This avmds mav avoid storing configuration data, since configuration data is read only. The 
current configuration ts mav be simply overwritten by the new one. 



Save le s s data Less Data 
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If a configuration is' defined to be uninterruptible (non pre-emptive), all of the local state on 
the busses and in the P AEs can be declared as scratch. This means that every configuration 
get smav get its input data from the IRAMs and wfitesma y write its output data to the IRAMs. 
So after the configuration has finished, all information in the PAEs and on the buses ismaybe 
redundant or invalid and a^^avinp nf the information might not have to b e saved ^required. 

Save — "Hifintl Hn»n nnlyMndificd Data Only 

To reduce the amount of R/W data, which has to be saved, wfr^ieed4 ethe method may keep 
track of the modification state of the different entities. TWs4«6«k maymcvy- a silicon area 
penalty for the additional "dirty" bits. 

Use eftehin gCaching to redaee Reduce the memory^ trafficMemory Traffic 

The configuration manager ha«dte smav handle manual preloading of configurations. 
Preloading wiHmay help in parallelizing the memory transfers with other computations 
during the task switch. This cache can also reduce the memory traffic for frequent context 
switches, provided that a Least Recentiy Used (LRU) replacement strategy is implemented in 
addition to the preload mechanism. 

The IRAMs can be defined to be local cache copies of main memory as propoood ao fourth 
n^.^l^n^ ;» . ..tinn n Ai^ru^^ed above under th e heading "Shared Memory (IRAM).» Then 
each IRAM i smavbe associated with a starting address and modification state information. 
The IRAM memory cells afernajLbe replicated. An IRAM PEA containsPAE may contain an 
IRAM block with multiple IRAM instances.-Qaly It may be that only tiie starting addresses 
of the, IRAMs have to ^be saved and restored as context. The starting addresses for the 
IRAMs of tiie current configuration select the IRAM instances with identical addresses to be 
used. 

If no address tag of an IRAM instance matches the address of the newly, loaded context, the 
corresponding memory area 4 smav be loaded to an empty IRAM instance. 

If no empty IRAM instance is available, a clean (unmodified) instance ismaybe declared 
empty (and hence »»»«tit may he required for it to be reloaded later on). 
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If no clean IRAM instance is available, a modified (dirty) instance ismay be cleaned by 
writing its data back to main memory. This add smav add a certain delay for the write back. 

This delay can be avoided^ if a separate state machine (cache controller) tries to clean inactive 
IRAM instances by using unused memory cycles to write back the IRAM instances' contents. 

2^,4 Context Switches 

Usually a processor is viewed as executing a single stream of instructions. But today's multi 
^tasking operating systems support hundreds of tasks being executed on a single processor. 
This is achieved by switching contexts, where all, or at least the most relevant parts^ of the 
processor state, which belong to the current task - the task's context - is exchanged with the 
state of another task, that will be executed next. 

There are three types of context switches: switching of virtual processors with simultaneous 
miiiti throadin omultithreading (SMT, also known as HyperThreading), execution of an 
Interrupt Service Routine (ISR)^ and a Task Switch. 

iAA — SMT Virtual Processor Switch 

This type of context switch 4 smav be executed without software interaction, totally in 
hardware. Instructions of several instruction streams are merged into a single instruction 
stream to increase instruction level parallelism and improve functional unit utilization. 
Hence, the processor state cannot be stored to and reloaded from memory between 
instructions from different instruction streams : Imagine tho worst case. For example, m an 
instance of alternating instructions from two streams and Ae-hundreds to feeusaadfliousands 
of cycles might be needed to write the processor state to memory and read in another state. 

Hence hardware designers have to replicate the internal state for every virtual processor. 
Every instruction i smav be executed withim the context (on the state) of the virtual processor, 
whose program counter was used to fetch the instruction. By replicating the state, only the 
multiplexers, which have to bbe inserted to select one of the different states, have to be 
switched. 

Thus the size of the state mav a lso iaefeasesincrease the silicon area needed to implement 
SMT, so the size of the state 4 smav be crucial for many design decisions. 
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iAri — Interrupt Service Routine 

This type of context switch is mav be handled partially by hardware and partially by software. 
AU Tt ma y be required for all of the state modified by the ISR has-to be saved on entry and 
Tww tit may be required for it to be restored on exit. 

The part of the state? which is destroyed by the jump to the ISR^is may be saved by 
hardware^ (e.g.^ the program counter). It t smav be the ISR's responsibility to save and restore 
the state of all other resources, that are actually used within the ISR. 

The more state information to be saved, the slower the interrupt response time wilhnay be 
and the greater the performance impact wittmay be if external events trigger interrupts at a 
high rate. 

The execution model of the instructions wiUmay also affect the tradeoff between short 
interrupt latencies and maximum throughput-^ Throughput ismay be maximized if the 
instructions in the pipeline are finished^ and the instructions of the ISR are chained. Thisjnay 
adversely ^feets^ffect the interrupt latency. If, however, the instructions are abandoned (pre- 
empted) in favor of a short interrupt latency, tbny muntit mav be required for them to be 
fetched again later, which affee temav affect throughput. The third possibility would be to 
save the internal state of the instructions within the pipeline, but this feqtrifesmay require too 
much hardware effort. Usually this is not done. 



2,4.3 — Task Switch 

This type of context switch is mav be executed totally in software. ^ It may be required for 
all of a task's context (state) has-to be saved to memory, and it may be required for the 
context of the new task has-to be reloaded. Since tasks are usually allowed to use all of the 
processor's resources to achieve top performance, it mav be required to save and restore all o 
the processor stat e haa to be saved and rostore d. If the amount of state' is excessive, it may b 
required for t he rate of context switches iH«stto be decreased by less frequent rescheduling, c 
a severe throughput degradation wilhnay result, as most of the time wiUmay be spent m 
saving and restoring task contexts. This m turn inefeasesmay increase the response time for 
the tasks. 
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2j5 A Load Store Architecture 

Wn prnpnne Tn an example embodiment of the present invention, an XPP integration maybe 
provided as an asynchronously pipelined functional unit for the RISC. We fuithcr propooo on 
An explicitly preloaded cache mav be provided for the IRAMs, on top of the memory 
hierarchy existing within the RISC (as proposed ao fourth method in section 2.2.3). discussed 
above under the heading "Shared Memorv (IRAM)." Additionally a de-centralized explicitly 
preloaded configuration cache within the PAE array i smav be employed to support 
preloading of configurations and fast switching between configurations. 

Since the IRAM content is an explicitly preloaded memory area, a virtually unlimited number 
of such IRAMs can be used. They af emav be identified by their memory address and their 
size. The IRAM content i smav be explicitly preloaded by the application. Caching wiUmay 
increase performance by reusing data from the memory hierarchy. The cached operation may 
also oliminatesgliminate the need for explicit store instructions; they af emay be handled 
implicitly by cache write back operations but can also be forced to synchronize with the 
RISC. 

The pipeline stages of the XPP functional unit af emav be Load, Execute, and Write Back 
(Store). The store i smav be executed delayed as a cache write back. The pipeline stagesjnay 
execute in an asynchronous fashion, thus hiding the variable delays from the cache preloads 
and the PAE array. 

The XPP fiinctional unit i smav be decoupled of the RISC by a FIFO , which is fed with the 
XPP instructions. At the head of this FIFO, the XPP PAE eeasmaesmay consume and 
eseewtesejcecute the configurations and the preloaded IRAMs. Synchronization of the XPP 
and the RISC is mav be done explicitly by a synchronization instruction. 

Instructions 

Tn thp f^ "" """c ^^rr. r^nfinn th^ F.mbodiments of th e present invention may require certain 
instruction formats noodod for tho proposed architocturo. Wo uoe. Data types may be 
specified using a C style prototype definitio n to apcoif>^ data typco. All inatructiono, except 
tiin YPPSynr. inntmctio n . The following are exa m ple instruction formats which may be 
required, all of which execute asynchronouslyr4%e, except for an XPPSync instruction, 
which can be used to force synchronization. 
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XPPPreloadConfig (void *ConfigurationStartAddress) 

The configuration ts mav be added to the preload FIFO to be loaded into the configuration 
cache within the PAE array. 

5 Note that speculative preloads areis possible^ since successive preload commands overwrite 
the previous. 

The parameter is a pointer register of the RISC pointer register file. The size is implicitly 
contained in the configuration. 



10 



XPPPreload (int IRAM, void *StartAddress, int Size) 
XPPPreloadClean (int IRAM, void *StartAddress, int Size) 



This instruction .yeeifie smav specify the contents of the IRAM ferfor the next configuration 
execution. In fact, the memory area i smavbe added to the preload FIFO to be loaded into the 
15 specified IRAM. 

The first parameter i smav be the IRAM number. This ismaybe an immediate (constant) 
value. 

20 The second parameter i smav be a pointer to the starting address. This parameter ismaybe 
provided in a pointer register of the RISC pointer register file. 

The third parameter 4 smavbe the size in units of 32 bit words. This ismaybe an integer 
value. It fesidesmayreside in a general- purpose register of the RISC's integer register file. 



25 



The first varian t may actually proloadsE rgload the data firom memory. 



The second variant i smav be for write-only accesses. It skipsmayjkiE the loading operation. 
Thus , it may be that no cache misses can occur for this IRAM. Only the address and size are 
30 defined. They are obviously needed for the write back operation of the IRAM cache. 

Note that speculative preloads are possible^ since successive preload commands to the same 
IRAM overwrite each other (if no configuration is executed inbotwo e nin between) . Thus, 
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only the last preload command i smav be actually effectiver when the configuration is 
executed. 

XPPExecute 0 

This instruction fl^teeate smav execute the last preloaded configuration with the last preloaded 
IRAM contents. Actually^ a configuration start command ismajibe issued to the FIFO. Then 
the FIFO i smav be ^HvancpH - thin menn n . This may mean that fiirther preload commands will 
specify the next configuration or parameters for the next configuration. 

Whenever a configuration finishes, the next one i smav be consumed firom the head of the 
FIFO, if its start command has aheady been issued. 

XPPSync (void *StartAddress, int Size) 
This instruction fefee smav force write back operations for all IRAMs that overlap the given 
memory area. If overlapping IRAMs are still in use by a configuration or preloaded to be 
used, this operation will block. Giving an address of NULL (zero) and a size of MAX_INT 
(bigger than the actual memory), this instruction can also be used to wait until all issued 
configurations finish. 

2sSj4 — A Basic Implementation 

As shown in Fie. 9. the XPP core 102 mav share a memorv hierarchy with the RISC core 1 12 
using a special cache controller 125-130. 

Fi g. 10 shows an IRAM and configuration cache con troller data structures and a usage 
example (instructions). 

The preload-FIFOs in thn nhnvo figure Fig. 10 mav contain the addresses and sizes for already 
issued IRAM pFe-Jeadspreloads, exposing them to the XPP cache controller. The FIFOs may 
have to be duplicated for every virtual processor in an SMT environment. JlTagH is the 
typical tag for a cache line containing starting address, size^ and state (empty I clean I dirty I 
in-use). The additional in-use state signals usage by the current configuration. The cache 
controller cannot manipulate these IRAM instances. 
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The execute configuration command advances may advance all preload FIFOs, copying the 
old state to the newly created entry. This way the following preloads may replace the 
previously used IRAMs and configurations. If no preload is issued for an IRAM before the 
configuration is executed, the preload of the previous configuration ismay be retained. 
Therefore JtmajLbeJiat it is not necessary to repeat identical preloads for an IRAM m 
consecutive configurations. 

Figure 3: Aoynchronouo pipolino of tho XPP 

Each configuration's execute command ha smav have to be delayed (stalled) until all 
necessary preloads are finished, either explicitiy by the use of a synchronization command or 
implicitly by the cache confroUer. Hence tiie cache contiroUer (XPP Ld/St unit) has l25 may 
have to handle tiie synchronization and execute commands as well, actiially starting the 
configuration as soon as all data is ready. After the termination of tiie configuration, dirty 
IRAMs af emav be written back to memory as soon as possible? if tiieir content is not reused 
in tiie same IRAM. Therefore tiie XPP P AE array OCPP core 102) and tiie XPP cache 
controller 125 can be seen as a single unit since they do not have different instiiiction 
streamsT^ ratherRather, the cache controller can be seen as tiie configuration fetch (CF), 
operand fetch (OF) (IRAM preload) and write back (WB) stage of tiie XPP pipeline, also 
triggering the execute stage (EX) (PAE array). Fip. 1 1 shows tiie asynchronous pipeline of 
the XPP 100. 

Due to tiie long latencies, and tiieir non-predictability (cache misses, variable lengtii 
configurations), tiie stages can be overlapped several configurations wide using the 
configuration and data preload FIFO, ([[=]] Le^ipeline), for loose coupling.-S^_If a 
configuration is executing and tiie data for tiie next has already been preloaded, tiie data for 
tiie next but one configuration i smav be preloaded-^These preloads can be speculative-tiie. 
The amount of speculation i smav be tiie compiler's trade-off. The reasonable lengtii of tiie 
preload FIFO can- be several configurationsfit4s,JLmay_be limited by diminishing retiims, 
algorithm properties, the compiler's ability to schedule preloads early and by silicon usage 
due to the IRAM duplication factor, which hasmav have to be at least as big as tiie FIFO 
length. Due to this loosely coupled operation, the interlocking -(to avoid data hazards 
between IRAMs} cannot be done optimally by software (scheduling), but ha smay have to be 
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enforced by hardware (hardware interlocking). Hence the XPP cache controller and the XPP 
PAE array can be seen as separate but not totally independent functional units. 

Figvirc 1 : State transition diagram for tho XPP cache controller 

The XPP cache controller has mav have several tasks. These are depicted as states in *e 
above diagram. FigJ^State transitions may t ake place along the edges between states, 
whenever the condition for the edge is true. As soon as the condition is not true any more, 
the reverse state transition take smav take place. The activities for the states ar emay be as 
foUowsT. 

At the lowest priority, the XPP cache controller hasl25 may have to fulfill already issued 
preload commands, while writing back dirty IRAMs as soon as possible. 

As soon as a configuration finishes, the next configuration can be started. This is a more 
urgent task than write backs or future preloads. To be able to do that, all associated yet 
unsatisfied preloads may have to be finished first. Thus, they afemaybe preloaded with the 
high priority inherited fi-om the execute state. 

A preload in turn can be blocked by an overiapping in-use or dirty IRAM instance in a 
different block or by the lack of empty IRAM instances in the target IRAM block. The 
former can be resolved by waiting for the configuration to finish and / or by a write back. To 
resolve the latter, the least recently used clean IRAM can be discarded, thus becoming empty. 
If no empty or clean IRAM instance exists, a dirty one hasmay have to be written back' to the 
memory hierarchy. It cannot occur that no empty, clean, or dirty IRAM instances exist, since 
only one instance can be in-use and there should be more than one instance in an IRAM 
block; [[-]] otherwise, no caching effect is achieved. 

Figure 5 : Adding Simultaneous multithreading 



In an SMT environment the load FIFOs maiLhave to be replicated for every virtual processor. 
The pipelines of the functional units are may be fed from the shared fetch / reorder / issue 
stage. All fimctional units may execute in parallel. Different units can execute instructions 
of different virtual pmre^snrs Fip. 13 shows adding of simultaneous multithreading. 

So we^the following design parameters, with their smallest initial value , may be obtained : 
. IRAM length: 128 words 
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The longer the IRAM length, the longer the running time of the configuration and 
the less influence the pipeline startup has. 

• FIFO length: ^ 

This parameter hetes mav help to hide cache misses during preloadingr-^The longer 
the FIFO length, the less disruptive is a series of cache misses for a single 
configuration. 

• IRAM duplication factor: (pipeline stages+caching factor)*virtual processors: 3 

Pipeline stages is the number of pipeline stages LD/EXAVB plus one for every 
FIFO stage above one: ^ 
Caching factor is the number of IRAM duplicates available for caching: 0 
Virtual processors is the number of virtual processors with SMT: 1 

The size of the state of a virtual processor is mainly dependent on the FIFO length. It iSt- 
FIFO length * #IRAM ports * (32 bit (Address) + 32 bit (Size)X 



This ha smav have to be replicated for every virtual processor. 

The total size of memory used for the IRAMs ismay be: 

#IRAM ports * IRAMIIRAM duplication factor* IRAM length * 32 bit^ 

A first implementation will probably keep close to the above-stated minimum parameters, 
using a FIFO length of one, an IRAM duplication factor of four, an IRAM length of 128 a 
no simultaneous multithreading. 



— Implementation Improvements 
Write Pointer 

To further decrease the penalty for unloaded IRAMs, a simple write, pointer may be used per 
IRAM, which keep smav keep track of the last address already in the IRAM. Thus, no stall is 
required, unless an access beyond this write pointer is encountered. This i smay be especially 
usefiilrr if all IRAMs have to be reloaded after a task switch-^The delay to the configuration 
start can be much shorter, especially, if the preload engine of the cache controller, chooses the 
blocking IRAM next whenever several IRAMs need fiirther loading. 
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Longer FIFOs 

The frequency at the bottom of the memory hierarchy (main memory) cannot be raised to the 
same extent as the frequency of the CPU core. To increase the concurrency between the 
RISC core 112 and the PACT XPP core. 102. the prefetch FIFOs in the above drawing FigOl 
can be extended. Thus, the IRAM contents for several configurations can be preloaded, like 
the configurations themselves. A simple convention makes clear which IRAM preloads 
belong to which configuratiom^ iheThs configuration execute switches to the next 
configuration context. This can be accomplished by advancing the FIFO write pointer with 
every configuration execute, while leaving it unchanged after every preload^-^Unassigned 
IRAM FIFO entries mav k eep their contents from the previous configuration, so every 
succeeding configuration wiUmav use the preceding configuration's IRAMx if no different 
IRAMx was preloaded. 

If none of the memory areas to be copied to IRAMs is in any cache, extending the FIFOs 
does not help, as the memory is the bottleneck. So the cache size should be adjusted together 
with the FIFO length. 

A drawback of extending the FIFO length is the increased likelihood that the IRAM content 
written by an earlier configuration is reused by a later one in another IRAM. A cache 
coherence protocol can clear the situation. Note, however, that the situation can be resolved 
more easily^-Jf an overlap between any new IRAM area and a currently dirty IRAM 
contents of another IRAM bank is detected, the new IRAM is simply not loaded until the 
write back of the changed IRAM has finished. Thus, the execution of the new configuration 
i smav be delayed until the correct data is available. 

For a short (single entry) FIFO, an overlap is extremely unlikely, since the compiler will 
usually leave the output IRAM contents of the previous configuration in place for the next 
configuration to skip the preload. The compiler dees may do so using a coalescing algorithm 
for the IRAMs / vector registers. The coalescing algorithm ismay be the same as used for 
register coalescing in register allocation. 

Read Only IRAMs IRAMS 

Whenever the memory^ that is used by the executing configurationj is the source of a preload 
command for another IRAM, an XPP pipeline stall eeettf^may occur. The preload can only 
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be startedv when the configuration has finished^ and— ^ if the content was modified—, the 
memory content has been written to the cache. To decrease the number of pipeline stalls, it 
i smav be beneficial to add an additional read- only IRAM state. If the IRAM is read only, the 
content cannot be changed, and the preload of the data to the other IRAM can proceed 
without delay. This yegtHfes mav require an extension to the preload instructions^^ The 
XppPreload and the XppPreloadClean instruction formats can be combined to a single 
instruction format^ that has two additional bitsj stating whether the IRAM will be read and/or 
written. To support debugging, violations should be checked at the IRAM ports, raising an 
exception when needed. 

2.5.3 — Support for Data Distribution and D ata Reorganization 

The IRAMs ^ mav be block-oriented structures, which can be read in any order by the PAE 
array. However, the address generation,-a€klsjmay_add complexity, reducing the number of 
PAEs available for the actual computation. So it io boot, if Accordingly, the IRAMs afeinay 
be accessed in linear order. The memory hierarchy is mav be block oriented as well, further 
encouraging linear access patterns in the code to avoid cache misses. 

As the IRAM read ports limit the bandwidth between each IRAM and the PAE array to one 
word read per cycle, it can be beneficial to distribute the data over several IRAMs to remove 
this bottleneck. The top of the memory hierarchy is the source of the data, so the 
amem rtnumber of cache misses never increases when the access pattern is changed, as long 
as the data locality is not destroyed. 

Many algorithms access memory in linear order by definition to utilize block reading and 
simple address calculations. In most other cases and in the cases where loop tiling is needed 
to increase the data bandwidth between the IRAMs and the PAE array, the code can be 
transformed in a way that data is accessed in optimal order. In many of the remaining cases, 
the compiler eaacam modify the access pattern by data layout rearrangements, (e.g.T, array 
merging), so that finally the data is accessed in the desired pattern. If none of these 
optimizations can be used because of dependencies, or because the data layout is fixed, there 
are still two possibilities to improve performance s, which are data duplication and data 
reordering. 
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Data Duplication 

Data i smav be duplicated in several IRAMs. This oircumventsmay circumvent the IR/ 
read port bottleneck, allowing several data items to be read from the input every cycle. 



Several options are possible with a common drawback-data^^ate duplication can only be 
applied to input dataT-e«^,_QutEiit IRAMs obviously cannot have overlapping address 
ranges. 

. e— Using several IRAM preload commands specifying just different target IRAMs: 
This way cache misses mav o ccur only for the first preload. All other preloads 
wittmay take place without cache misses— eiily,_Qnlx the time to transfer the data 
from the top of the memory hierarchy to the IRAMs is needed for every additional 
load. This is only beneficiaU if the cache misses plus the additional transfer times do 
not exceed the execution time for the configuration. 

. e— Using an IRAM preload instruction to load multiple IRAMs concurrently: 

As identical data is needed in several IRAMs, they can be loaded concurrently by 
writing the same values to all of them. This amounts to finding a clean IRAM 
instance for every target IRAM, connecting them all to the bus, and writing the data to 
tiie bus. The problem with this instruction i smav be that it requires a bigger 
immediate field for the destination (16 bits instead of 4 for the XPP 64). Accordingly, 
this instruction format ^^ew smav grow at a higher rate? when the number of IRAMs is 
increased for bigger XPP arrays. 

The interface of this instruction Innkr. likeis for example : 
XPPPreloadMultiple (int IRAMS, void * Start Address, int Size)^ 

This instruction bebave smav behave as the XPPPreload / XPPPreloadClean instructions with 
the exception of the first parameter-^The first parameter is IRAMS. This wmay be an 
immediate (constant) value. The value t smav be a bitmap— fefjor every bit in the bitmap, 
the IRAM with that number i smav be a target for the load operation. 

There is no "clean" version, since data duplication is applicable for read data only. 
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Data Reordering 

Data reordering changes the access pattern to the data, only. It does not change the amount of 
memory that is read. Thus^ the number of cache misses staysmay stay the same. 
• e-Adding additional functionality to the hardware: 
> e-Adding a vector stride to the preload instruction. 

A stride (displacement between two elements itf memory) ismay be used in vector 
load operations to load:, e.g.T, a column of a matrix into a vector register. 

This is still a linear access pattern. It can be implemented in hardware by giving a 
stride to the preload instruction and adding the stride to the IRAM identification state. 
One problem with this instruction i smav be that the number of possible cache misses 
per IRAM load rises^-^In the worst case it can be one cache miss per loaded value? if 
the stride is equal to the cache line size and all data is not in the cache. But as abeady 
stated^-^^the total number of misses stays the same-fust^-iSii the distribution 
changes. StilU this is an undesirable effect. 

The other problem i smav be the complexity of the implementation and a possibly 
limited throughput, as the data paths between the layers of the memory hierarchy are 
optimized for block transfers. Transferring non-contiguous words will not use wide 
busses in an optimal fashion. 

The interface of the instruction lookr . lik e is for example : 
XPPPreloadStride (int IRAM, void *StartAddress, int Size, int Stride) 

XPPPreloadCleanStride (int IRAM, void *StartAddress, int Size, int Stride). 

This instruction bebaves mav behave as the XPPPreload / XPPPreloadClean 
instructions with the addition of another parameterr^The fourth parameter is the 
vector stride. This kmaiLbe an immediate (constant) value. It tettsmayteU the cache 
controller? to load only every n* value to the specified IRAM. 

, e— Reordering the data at run time, introducing temporary copies. 
> e-On the RISC: 
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The RISC can copy data at a maximum rate of one word per cycle for simple 
computations and at a somewhat lower rate for more complex ones. 



With a memory hierarchy, the sources wi Umav be read from memory (or cache, if 
they were used recently) once and written to the temporary copy, which wftllmaY t! 
reside in the cache, too. This wcfease smav increase the pressure in the memory 
hierarchy by tiie amount of memory used for the temporaries. Since temporaries a 
allocated on the stack memory, which t smav be re-used frequently, the chances ar( 
good that the dirty memory area is ro dofinod redefmed before it is written back to 
memory. Hence the write back operation to memory is of no concern. 



• e-Via an XPP configuration: 

The PAE array can read and write one value from every IRAM per cycle. Thus, if 
half of the IRAMs are used as inputs and half of the IRAMs are used as outputs, up to 
eight (or more, depending on the number of IRAMs), values can be reordered per 
cycle, using the, PAE array for address generation. As the inputs and outputs reside ir 
IRAMs, it does not matter^ if the reordering is done before or after the configuration-r 
that uses the data— A e. The IRAMs can be reused immediately. 



IRAM Chaining 

If the PAEs do not allow fiirther unrolling, but there are still IRAMs left unused, it «may be 
possible to load additional blocks of data into these IRAMs and chain two IRAMs fey^Heafts 
efvia an address selector. This dee smight not increase throughput as much as unrolling 
would do, but it still help smav help to hide long pipeline startup delays whenever unrolling is 
not possible. 

2^ Software / Hardware Interface 

According to the design parameter changes and the corresponding changes to the hardware, 
thn. hnrd w.ire according to embodiments o f the present invention, the hardware / software 
interface has changed. In the following-the4»est,^ome prominent changes and their handling 
wiH-beare discussedr. 
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— Explicit Cache 

The proposed cache is not a usual cache, which would be— aet, without considering 
performance issues in vioible . invisible to the progranuner / compiler, as its operation is 
transparent. The proposed cache is an explicit cache. Its state hasmay have to be maintained 
5 by software. 

Cache Consistency and Pipelining of Preload / Configuration / Write back 

The software t smav be responsible for cache consistency. It «may be possible to have 
several IRAMs caching the same? or overlapping memory areas. As long as only one of the 
10 IRAMs is written, this is perfectly okr-^Only this IRAM will be dirty and will be written' 
back to memory. If, however, more than one of the IRAMs is written, i twhich data will be 
written to memory is not defined , which data ^\ill bo wTitton to momorj^ . This is a software 
bug (non -deterministic behavior). 

1 5 As the execution of the configuration is overlapped with the preloads and write backs of the 
IRAMs, it i smav be possible to create preload / configuration sequences^ that contain data 
hazards. As the cache controller and the XPP array can be seen as separate fiinctional units, 
which are effectively pipelined, these data hazards are equivalent to pipeline hazards of a 
normal instruction pipeline. As with any ordinary pipeline, there are two possibilities to 

20 resolve this^, which are hardware interlocking and software interlocking. 



• Hardware interlocking: 

Interlocking ts mav be done by the cache controller- Jf the cache controller detects^ that 
the tag of a dirty or in-use item in IRAMx overlaps a memory area used for another 
IRAM preload, it has mav have to stall that preload, effectively serializing the execution 
of the current configuration and the preload. 



• Software interlocking: 

If the cache controller does not enforce interlocking, the code generator ha smay have to 
30 insert explicit synchronize instructions to take care of potential interlocks. Inter- 

procedural and inter -modular alias- and data- dependency analyses can determine if this 
is the case, while scheduling algorithms may.help to alleviate the impact of the necessary 
synchronization instructions. 
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In either case, as well as in the case of pipeline stalls due to cache misses, SMT can use the 
computation power^ that would be wasted otherwise. 

Code Generation for the Explicit Cache 
Apart from the explicit synchronization instructions issued with software interlocking, the 
following instructions may h ave to be issued by the compiler. 

• Configuration preload instructions, preceding the IRAM preload instructions, that will be 
used by that configuration. These should be scheduled as early as possible by the 
instruction scheduler. 

• IRAM preload instructions, which should also be scheduled as early as possible by the 
instruction scheduler. 

• Configuration execute instructions, following the IRAM preload instructions for that 
configuration. These instructions should be scheduled between the estimated minimum 
and the estimated maximum of the cumulative latency of their preload instructions. 

• IRAM synchronization instructions, which should be scheduled as late as possible by the 
instruction scheduler. These instructions must be inserted before any potential access of 
the RISC to the data areas that are duplicated and potentially modified in the IRAMs. 
Typically^ these instructions will follow a long chain of computations on the XPP, so they 
will not significantly decrease performance. 

Asynchronicity to etbwOther Functional Units 

AAn XppSyncQ must be issued by the compiler, if an instruction of another fiinctional unit 
(mainly the Ld/St unit) can access a memory area^ that is potentially dirty or in-use in an 
IRAM. This feree smav force a synchronization of the instruction streams and the cache 
contents, avoiding data hazards. A thorough inter-procedural and inter-modular array alias 
analysis IkaitsmajLlimit the frequency of these synchronization instructions to an acceptable 
level. 
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2y7 Another Implementation 

For the previous design, the IRAMs are existent in silicon, duplicated several times to keep 
the pipe-feiepiEeline busy. This ameuats mav amount to a large silicon area, that is not fully 
busy all the time, especially, when the PAE array is not used, but as well whenever the 
configuration does not use all of the IRAMs present in the array. The duplication may.also 
fflake smake it difficult to extend the lengths of the IRAMs, as the total size of the already 
large IRAM area scales linearly. 

For a more silicon efficient implementation, wo ohould integrate t he IRAMs jnayjbe 
integrated into the first level cache, making this cache bigger. This means; tha t wo have to 
oxtond the fu-st level cache controller is extended t o feed all IRAM ports of the PAE array. 
This wayr the XPP and the RISC wiBmav share the first level cache in a more efficient 
manner. Whenever the XPP is executing, it wiHmay steal as much cache space as it needs 
from the RISC. Whenever the RISC alone is running it will have plenty of additional cache 
space to improve performance. 

The PAE array fea smav have the ability to read one word and write one word to each IRAM 
port every cycle. This can be limited to either a read or a write access per cycle, without 
limiting programmabilityi-^If data has to be written to the same area in the same cycle, 
another IRAM port can be used. This in omnn e nmav increase the number of used IRAM 
ports, but only under rare circumstances. 

This leaves sixteen data accesses per PAE cycle in the worst case. Due to the worst case of 
all sixteen memory areas for the sixteen IRAM ports mapping to the same associative bank, 
the minimum associativity for the cache is mav be a 16-way set associativity. This awidsmay 
avoid cache replacement for this rare, but possible^ worst-case example. 

Two factors may h elp to support sixteen accesses per PAE array cycle: 
• -The clock frequency of the PAE array generally hasT to be lower than for the RISC by a 
factor of two to four. The reasons lie in the configurable routing channels with switch 
matrices which cannot support as high a fi-equency as solid point-to-point 
aluminium aluminum or copper traces. 
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This means that two to four IRAM port accesses can- be handled serially by a single 
cache port, as long as all reads are serviced before all writes, if there is a potential 
overiap. This can be accomplished by assuming a potential overlap and enforcing a 
priority ordering of all accesses, giving the read accesses higher priority. 

• A factor of two, four^ or eight is possible by accessing the cache as two, four^ or eight 
banks of lower associativity cache. 

For a cycle divisor of four, four banks of four-way associativity will be optimal. During 
four successive cycles, four different accesses can be served by each bank of four way 
associativity. Up to four-way data duplication can be handled by using adjacent IRAM 
ports that are connected to the same bus (bank). For further data duplication, the data 
ha smav have to be duplicated explicitly, using an XppPreloadMultipleQ cache controller 
instruction. The maximum data duplication for sixteen read accesses to the same memory 
area is supported by an actual data duplication factor of four-^ one copy in each bank. 
This does not affect the RAM efficiency as adversely as an actual data duplication of 16 
for the fi Pi i c n pr^p"""^ pmhodiment discussed above under the heading "A 

T^ad Store Architecture." 

Fie. 14 shows an example of a cache structure acc o rding to an example embodiment of the 
present invention. The cache controller is runningm myn at the same speed as the RISC. 
The XPP is-HHH«Rsnayrun at a lower, (e.g., quarter), speed.^Fhifrw ^ Accordingly, in the 
worst case-eC sixteen read requests fi-om the PAE array ftee44emay be serviced in four cycles 
of the cache controller, with an additional four read requests from the RlSC^-Se. 
Accordingly, one bus at full speed can be used to service four IRAM read ports. Using four- 
way associativity, foixr accesses 

per cycle can be serviced, even in the case that all four accesses go to addresses that map to 
the same associative block. 

a) The RISC still has a 16-way set associative view of the cache, accessing all four four- 
way set associative banks in parallel. Due to data duplication, it is possible^ that 
several banks return a hit. This has-temav be taken care of with a priority encoder, 
enabling only one bank onto the data busT, 
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b) The RISC is blocked from the banks that service IRAM port accesses. Wait states are 
inserted accordingly. 

c) The RISC shares the second cache access port of a two-port cache with the RAM 
interface, using the cycles between the RAM transfers for its accesses. 

d) The cache is extended by a fifth 4-way set associative bank, used exclusively by the 
RISC. (The other banks are only accessed? when they are not used by the current XPP 
configuration. PROBLEM: dirty line in a blocked bank). 

-> With respect to a 2 port RAM, concurrent reads ok, ooncurrentmay be acconunodated. 
Concurrent R/W to a same cache line may be avoided by SWsoftware synchronization / HW 
nrhitr e r hardware arbiter. 

AnothorA problem is that a read could potentially address the same memory location as a 
write The value read dependsmaydeEend on the order of the operation? so thaLthe 
order is fixedrj^ all writes have to take place after all reads, but before the reads of the next 
cycle, except, if the reads and writes actually do not overlap. This can only be a problem 
with data duplication, when only one copy of the data is actually modified. Therefore, 
modifications are forbidden with data duplication. 

— Proerammi n p Model Changes 
Data Interference 

With thir-. rtomP nAccording to an example embodiment o f the present invention that is without 
dedicated ms sIRAMs, it is not possible aHy-mefeanymore to load input data to the IRAMs 
and write the output data to a different IRAM, which is mapped to the same address, thus 
operating on the original, unaltered input data during the whole configuration. 

As there are no dedicated IRAMs a»y-iHereanymore, writes directly modify the cache 
contents, which will be read by succeeding reads. This changes the programming model 
significantly. Additional and more in-depth compiler analyses are neeessai^accordingly 
necessary . 
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JrTii — Hiding Implementation Details 

The actual number of bits in the destination field of the XppPreloadMultiple instruction is 
implementation dependent. It depends on the number of cache banks and their associativity, 
which are determined by the clock fi-equency divisor of the XPP PAE array relative to the 
cache frequency. However, this can be hidden by the assembler, who tronolateswhich may 
translate IRAM ports to cache banks, thus reducing the number of bits from the number of 
IRAM ports to the number of banks. For the user^ it is sufficient to know^ that each cache 
bank services an adjacent set of IRAM ports starting at a power of two. Thus, it ismaybe 
best to use data duplication for adjacent ports, starting with the highest power of two 
feiggergreater than the number of read ports to the duplicated area. 

3 Program Optimizations 

3vi Code Analysis 

In th i s ce c tion "^-^ Hn.rr ihn thn nm1yp , nn thnt cti n Analvses mav be performed on programsr 
Thooo analyoco aro then uocd by different optimizations. They Jo describe the relationships 
between data and memory ieeatienslocation in tho program. More dotailo can bo found in 
n^irnrnl hnnlm p 1 5] a program. These analyses ma v then be used by different optimizations. 
More details regarding the analyses are discussed in Michael Wolfe, "H ig h Performance 
Compilers for Parallel Computing" TAddison-Weslev \996Y Hans Zima & Barbara 
Chapman. "Sunercompilers for parallel and vector com puters" (Addi son-Wesley 1991); and 
Steven Muchnick. "Advanced Compiler Design and Im plementation" TMorgan Kaufinann 
1997) . 



^AA — Data-Flow Analysis 

Data-flow analysis examines the flow of scalar values through a program^ to provide 
information about how the program manipulates its data. This information can be 
represented by dataflow equations that have the following general form for object that can 
be an instruction or a basic block, depending on the problem to solve: 

Ex[q = Procl[i\ Y (/«[/•] - SuppliW 
ftThis means that data available at the end of the execution of object /, ExfiJ, are either 
produced by i, ProdfiJ or were alive at the beginning of i, InfiJ, but were not deleted during 
the execution of /, SuppfiJ. 

These equations can be used to solve several problems44keT, such as, e.g., 
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-the problem of reaching definitions^; 

-the Def-Use and Use-Def chains, describing respectively^ for a definition, all 



uses that can be reached from it, and, for a use, all definitions that can reach itri 

• ^the available expressions at a point in the program?; and/or 

-the live variables at a point in the program, whose solutions are then used by 



several compilation phases, analysis, or optimizations. 

Ao anFor ^-v^mplp lot nr . tnlca the , with respect to a problem of computing the Def-Use chains 
of the variables of a program^45Hs^this information can be used for instance by the data 
dependence analysis for scalar variables or by the register allocation. A Def-Use chain is 
associated to each definition of a variable and is the set of all visible uses from this definition. 
The data-flow equations presented above if emav be applied to the basic blocks to detect the 
variables that are passed from one block to another along the control- flow graph.4ft4he 
figiiro holn w In Fie. 15. which shows a control-flo w graph of a piece of a program, two 
definitions for variable x are produced: 57 in Bl and S4 in B3. Hence, the variable that can 
be found at the exit of 57 is Ex(Bl) = {x(Sl)}; and at the exit of B4 is Ex(B4) = {x(S4)}. 
Moreover-wetev^^£'xr52; = Ex(B]) as no variable is defined in B2. Using these sets, we 
fiftd it is the case that the uses of jc in S2 and S3 depend on the definition of x in ^^7 and that 
the use of x in S5 depead depends on the definitions of x in Bl and B3. The Def-use chains 
associated with the definitions are then D(SIS1) = {S2, S3, S5} and D(S4) = {S5}. 

Figure. 7: Control flow graph of a piece of program 

— Data Dependence Analysis 
A data dependence graph represents the dopcndoncesdependencies existing between 
operations writing or reading the same data. This graph ismay be used for optimizations like 
scheduling, or certain loop optimizations to test their semantic validity. The nodes of the 
graph represent the instructions, and the edges represent the data dopondonceo.dependencies. 
These Hnpnndonces dependencies can be of three types: true (or flow) dependence when a 
variable is written before being read, anti-dependence when a variable is read before being 
written, and output dependence when a variable is written twice.-HeFfr4sra^ more formal 
definition is provided in Hans Zim a et al.. suora and is presented below. 
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Definition 

Lot S and S' bo 2 statem e nts, th e n S' doponds on S, noted S 5 S' iff: 
(-i) S is oxooutod befor e S" 

(3) V c WAR : v c DEF(S) I USE(S')v v c USE(S) IDEF(S') v v c DEF(S) IDEF(S') 

5 Let S and S' be two statements. Then S' depends on S. noted S^S' iff: 

( 3) There io' no statomont T such that S. ib uji ccutcd before^ T nnd 7T L_g is executed 
before S\ and v c DEF(T)l 

m F, E VAR-.ve DEFfSU USE(S'^ we USE(S) T DEFfS') v v € D£Fr5)/ D^FrS"; 
(3^ There is no statement T such that S is executed befor e T and 7 is executed before S\ 
10 anAveDEF(T), 

Whef ewhere VAR is the set of the variables of the program, DEF(S) is the set of the variables 
defined feeyby instruction S, and USE(S) is the set of variables used by instruction S. 

Moreover^ if the statements are in a loop, a dependence can be loop- independent or loop- 
15 carried. This notion introduces the definition of the distance of a dependence. When a 

dependence is loop- independent it moano that, it occurs between two instances of different 
statements in the same iteration, and-4hen its distance is equal to 0. On the contrar)^_ By 
contrast, when a Hppftndence is loop carried, it occurs between two instances in two different 
iterations the dependonco is loop carried , and tfeeits distance is equal to the difference 
20 between the iteration nvmibers of the two instances. 

The notion of direction of dependence generalizes the notion of distance, and is generally 
used when the distance of a dependence is not constant, or cannot be computed with 
precision. The direction of a dependence is given by <-, if the dependence between S and S^'_ 
25 occurs when the instance of S is in an iteration before the iteration of the instance of S^L = if 
the two instances are in the same iteration, and > if the instance of 5 isjn an iteration after the 
iterationT of the instance of 5'. 

In the case of a loop nest, wo. h^wa thon there are distance and direction vector, with one 
30 element for each level of the loop nest. The figures bclov , ' iUuotrato all those definitions. 
Figs. 16 to 20 illustrate these definitions. Fig. 16 illus trates a code and diagram of an 
example of a true dependence with distance 0 on arrav 'a'. Fig. 17 illustrates a code and 
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^ia prRtn of an example of an anti-dependence wi t h distance 0 on array 'b'. Fip. 1 8 illustrates 
a cnde. and diagram of an example of an output dep e ndence with distance 0 on array 'a'. Fig. 
1 Q ilhistrates a code and diagram of an exampl e of a dependence with direction yector(= =) 
hPtween SI and S2 and a dependence with direction vector ( = =,<) between S2 and S2. Fig. 
5 20 illustrates a code and diapram of an example of an a nti-dependence with distance vector 

The data dependence graph i smavbe used by a lot of optimizations, and ismay also be.useful 
to determine if their application is valid. For instance, a loop can be vectorized if its data 
1 0 dependence graph does not contain any cycle, 
f or (i-0; i^M; i-i+l) ( 

S^. a[i]-b[i] + l; 

St^ o[i] - a[i] + 2; 

15 

Figure 8: Example of a true dependence with distance 0 on array a 
for(i-0;i'"N;i-i+l) [ 
S: a[i] - b[i] + 2; 
SI b[i] - CI:!] + 2; 
20 \ 

Figure 9: Example ofan.anti dependence with distance 0 on arra}< b 

for (i-0; i-"^I; i-i+1) [ 

Si a[i]- b[i]+ 1; 

a [i] - c[i] 2; 

25 \ 
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Figure 10: Example of an output dependence with disianco 0 on arra}' n 

i 

5 c[i] G] - 0; 

for Oc-0; k^-M; k-t-^) 

Sa= c[i] D]-c[i][i]- i a[i][k]*b[k]D]; 

Figure 1 1: Example of a dependence with direction vector (-, -) be 
10 twecn SIandS2 and a dependence with direction vector -, -, <) be 

tyvccn S2 and S2. 

for (i-0; i<-N; i++) 
fei^e' '0;j<-N;j++) 

a[i] D] - L) '^3 + 

15 Figure 12: Example of an anti dependence with distance vector (0,2). 

3vi^3 — Interprocedural Alias Analysis 

TfeeAn aim of alias analysis isv to determine if a memory location is aliased by several 
objects, likee.g.. variables or arrays, in a program. It hasmay have a strong impact on data 
20 dependence analysis and on the application of code optimizations. Aliases can occur with 
statically allocated data, like unions in C where all fields refer to the same memory area, oi 
with dynamically allocated data, which are the usual targets of the analysis. In Figure 13, \ 
haveoA typical case of aliasing where p alias Ihjs: 
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int b [100] , *p; 
for (p=b;p < Scb[100]; p++) 
*p=0; 

Figure 13: Example for typical aliasing 



Alias analysis can be more or less precise depending on whether or not it takes the control- 
flow into account. When it does, it is called flow-sensitive, and when it does' not, it is called 
flow- insensitive. Flow-sensitive alias analysis is able to detect in which blocks along a path 
two objects are aliased. As it is more precise, it is more complicated and more expensive to 
compute. Usually flow- insensitive alias information is sufficient. This aspect is illustrated 
in Figuro ll inFig.21 where a flow-insensitive analysis would find that;? alias b, but where a 
flow-sensitive analysis would be able to find that p alias b only in block B2. 

Furthermore, aliases are classified into must-aliases and may-aliases. For instance, i&we 
eensideFconsidering flow-insensitive may-alias information,-*en x alias y, iff x and y may, 
possibly at different times, refer to the same memory location. And if wo oonoider 
Considering flow-insensitive must-alias information, x alias y, iff x and y must, throughout 
the execution of a procedure, refer to the same storage location. In the case of Figure n,F ig, 
21. if wo conoidor flow-insensitive may-alias information is considered, /? g//a5 b holds, 
whereas if wo conoidor flow-msensitive must-alias information is considered, ;? a//a.y b does 
not hold. The kind of information to use depends on the problem to solve. For instance, if 
wn wnnt tn romove removal of redundant expressions or statements is desired, must-aliases 
must be used, whereas if wo wont to b uild ^a data dependence graph is desired, may-aliases 
are necessary. 

Figuro 1 4: Example of control flow scmitivity 

Finally this analysis must be interprocedural to be able to detect aliases caused by non-local 
variables and parameter passing. The latter case is depicted in Figure-t ^the code below, 
which is an example for alia sinp parameter passing, where i andy are aliased through the 
fimction call where k is passed twice as parameter. 
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void foo (int *i, int* j) 
{ 

*i = *j+l; 

} 

foo (6ck, Sck) ; 
Figure 15: Example for aliasing by parameter passing 

^AA — Interprocedural Value Ranee Analysis 

This analysis can find the range of values taken by the variables. It can help to apply 
optimizations like dead code elimination, loop unrolling andr others. For this purpose^ it can 
use information on the types of variables and then consider operations applied on these 
variables during the execution of the program. Thus^ it can determine, for instance, if tests in 
conditional instruction are likely to be ^met or not, or- determine the iteration range of loop 
nests. 

This analysis has to be interprocedural as, for instance, loop bounds can be passed as 
parameters of a function, Mkeas in the following example. Wo Icnow It is known by analyzing 
the code that in the loop executed with array V, is at least equal to 1 1 , and that in the loop 
oxGGu.te d executed with array Ibl, N is at most equal to 10. 

void foo (int *c, int N) 

{ 

int i ; 

for (i=©0; i<N; i++) 
c [i] = g (i, 2) ; 

} 

if (N > 10) 

foo (a, N) ; 

else 

foo (b,N) ; 
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The value range analysis can be supported by the programmer by giving further value 
constraints which cannot be retrieved from the language semantics. This can be done by 
pragmas or a compiler known assert function. 

Sri^S — Alignment Analysis 

Alignment analysis deals with data layout for distributed memory architectures. As stated by 
Saman Amarasingher^ "Although data memory is logically a linear array of cells, its 
realization in hardware can be viewed as a multi-dimensional array. Given a dimension in 
this array, alignment analysis will identify memory locations that always resolve to a single 
value in that dimension. For example, if the dimension of interest is memory banks, 
alignment analysis will identify if a memory reference always accesses the same bank^-!! 
This is the case in the second part of tho figure bolow, that can be found in [10], whoro all 
nn.nnnnn An ^\rioA in hliiR Fig, 22. which is 3 repfoduction of a figure that can be found in 
Sam Larsen. Kmmet Witchel & Saman Amarasinehe. "Increasing and Detecting Memory 
Address Congruence." Proceedings of the 2002 IEEE International Co nference on Parallel 
Architectures and Comvilation Techniques (P ACr02). 18-29 TSeptember 2002). All 
accesses, depicted in dark squares , occur to the same memory bank, whereas in the first part, 
the accesses are not ^Hgn^H -K ^ Saman Amarasinehe adds thea-thatv "Alignment information 
is useful in a variety of compiler-controlled memory optimizations leading to improvements 
in programmability, performance, and energy consumption." 

Alignment analysis, for instance, is able to help find a good distribution scheme of the data 
and is furthermore useful for automatic data distribution tools. An automatic alignment 
analysis tool can be able to automatically generate alignment proposals for the arrays 
accessed in a procedure and thus simplifies the data distribution problem. This can be 
extended with an interprocedural analysis taking into account dynamic realignment. 

Alignment analysis can also be used to apply loop alignment that transforms the code directly 
rather than the data layout in itself, as shown lat e r, discussed below. Another solution can be 
used for the PACT XPP, relying on the fact that it can handle aligned code very efficiently .r 
It eensists-iftincludes adding a conditional instruction testing if the accesses in the loop body 
are aligned followed by the necessary number of peeled iterations of the loop body, then the 
aligned loop body, and then some compensation code. Only the aligned code is then 
executed by the PACT XPPr*e Jie rest i smav be executed by the host processor. If the 
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alignment analysis is more precise (inter-procedural or inter-modular)^ less conditional 
has to be inserted. 



10 



15 



20 



3j3 r.nde O ptimizations 

Mfts tPiscussion regarding many of the optimizations and transformations present e d 
heFediscussedbelQW can be found in detail in [1], and also in [2,3,5].David F. Bacon, Susan 
T.. Graham & Oliver J. Sharp. "Compiler Transfor mations for High-Performance 
Camnutinz:' ACM Computine Surveys. 26f4V325-420 f 19 94V Michael Wolfe, supra; Hans 
Zima et al- supra: and Steven Muchnick. supra. 

3,2t1 — General Transformations 

\Vn prnrpnt in thin noction P iscussed below are a few general optimizations that can be 
applied to straightforward code^ and to loop bodies. These are not the only ones that appear 
iuT a compile r, but thoy are mentioned in the ooquel of thio dooumont . 

Constant Propagation 

Tliir, n primWnt inn prnpntrntP^- A Constant Propagat i on mav propagate the values of constants 
into the expressions using them throughout the program. This way a lot of computations can 
be done statically by the compiler, leaving less work to be done during the executionrthis^ 
This part of the optimization i-sis also known as constant folding. 



25 



An example of constant propagation is: 
N = 256; 
c = 3; 

for {i=0; i<=N; i++) 
a[i] = b[i] + c; 



for(i=0; i<=256; i++) 
a[i] = b [i] + 3; 



Figure 16: Example of constant propagation 
Copy Propagation 

30 -^ ^P^ COPY propagation optimization r . implifier.mav simplify the code by removing 
redundant copies of the same variable in the code. These copies- can be produced by i 
programmer hims e lf or by other optimizations. This optimization reduee smay reduce 
register pressure and the number of register-to-register move instructions. 
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An example of copy pr opagation is: 

t = i*4; t = i*4;r 

J. = t; for (i = 0; i<=N; i++) 

for {i=0; i<=N; i++) a[t] = b[t] + a[i]; 

a[r] = b[r] + a[il ; 

Figure 17: Example of copy propagation 



Dead Code Elimination 

10 qPO i^A dead code elimination optimization fefflevesvmay remove pieces of code that will 
never be executed. Code is never executed if it is in the branch of a conditional statement 
whose condition is always evaluated to true or false, or if it is a loop body, whose number of 
iterations is always equal to 0. 

1 5 Code updating variables^ that are never used? is also useless and can be removed as well. If a 
variable is never used, then the code updating it and its declaration can also be eliminated. 

An example of dead code elimination is: 

for(i=0;i<=N; i + + ) { for (i=0; i<=N; i++) { 

20 for (j=0; j<0; j++) for (j=0; j<10; j+ + ) 

a[j] = b[j] + a[i]; a[j+l] = a[j] + b[j]; 

for (j=0; j<10; j++) } 
a[j+l] = a[j] + b[j] ; 

} 

25 Figure 18: Example of dead code elimination 



Forward Substitution 

Tbi ^A forward substitution optimization is a generalization of copy propagation. The use of 
a variable i smavbe replaced by its defining expression. It can be used for simplifying the 
30 data dependency analysis and- the application of other transformations by making the use of 
loop variables visible. 
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An example of forward substitution is: 

c = N + 1; for (i=0; i<=N; i++) 

for (i = 0; i<= N; i++) a[N+l] = b[N+l] + a[il; 

a[c] = b[c] + a[i] ; 

Figure 19: Example of fonvard substitution 



Idiom Recognition 

Ttw iAn idiom recognition transformation rnrn^nizonmav recognize pieces of code and can 
replace them by calls to compiler known functions, or less expensive code sequences, like 
1 0 code for absolute value computation. 

An example of idiom recognition is: 

for (i=0; i<N; i++) { for (i=0; i<N; i++) { 

c = a[i] - b[i] ; c = a[i] - b[i] ; 



15 
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if (c<0) c = abs (c) 



c = -c; 



d[i] = c; 



} 



d[i] = c; } 

Figure 20: Example of idiom recognition 



3,2t3 — Loop Transformations 

Loop Normalization 

Tb wA loop normalization transformation ensuresmay ensure that the iteration space of the 
loop is always with a lower bound equal to 0 or 1 (depending on the input language), and 
25 with a step of 1 . The array subscript expressions and the bounds of the loops are modified 
accordingly. It can be used before loop fusion to find opportunities, and ease inter-loop 
dependence analysis, and it also enables the use of dependence tests that aeedsn^da 
normalized loop to be applied: 

30 An example of loop normalization is: 

for (i=2; i<N; i=i+2) for (i=0; i<(N-2)/2; i++) 

a[i] =b[i]; a[2*i+2] =b[2*i+2]; 
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Figure 21: Example of loop normalization 



5 Loop Reversal 

Tbi sA loop reversal transformation ebange smav change the direction in which the iteration 
space of a loop is scanned. It is usually used in conjunction with loop normalization and 
other transformations, like loop mterchange, because it changes the dependence vectors. 

10 An example of loop reversal is: 

for (i=N; i>=0; i--) for {i=0; i<=N; i++) 

a[i] = b[i] ; a[i] = b [i] ) 

Figure 22: Example of loop reversal 



1 5 Strength Reduction 

TtoA stren gth reduction transformation feptaee smav replace expressions in the loop body by 
equivalent but less expensive ones. It can be used on induction variables, other than the loop 
variable, to be able to eliminate them. 

20 An example of strength reduction is : 

for (i=0; i<N; i++) t = c; 

a[i] = b[il + c*i; for (i=0; i<N; i++) { 

a[-i] = b[i] + t; 
t = t + c; 

25 ) 

Figure 23: Example of strength reduction 



Induction Variable Elimination 

lOws An induction variable elimination transformation can use strength reduction to remove 
30 induction variables from a loop, hence reducing the number of computations and easing the 
analysis of the loop. This may also remevesremove dependence cycles due to the update of 
the variable, enabling vectorization. 
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An example of induction variable elimination is: 

for (i=0 ; i<=N; i++) { for (i=0; i<=N; i++) { 

k = k + 3; a[i] = b[i] + a [k+ (i+1) *3] ; 

a[i] = b[i] + a[k] ; } 

} 

k = k + (N+1) *3; 
Figure 24: Example of induction variable elimination 



T .nnp- invarian t lnvariant Code Motion 
-Xhksits. Inn p-invariantcode motion transformation me^esmaY move computations outside a 
loop if their result is the same in all iterations. This allows to r o ducemay allow a reduction of 
the number of computations in the loop body. This optimization can also be conducted in the 
reverse fashion in order to get perfectly nested loops, that are easier to handle by other 
optimizations. 

An example of loop-invariant c ode motion is: 

for (i=0 ; i<N; i++) if (N >= 0) 

a [i] = b[i] + x*y; c = x*y; 

for (i=0; i<N; i++) 

a [i] = b [i] + c; 
Figure 25: Example of loop invariant code motion 



Loop Unswitching 

Tbif^ A loop unswitching transformation mevesmay move a conditional instruction outside of 
a loop body if its condition is loop- invariant. The branches of the condition aremay then be 
made of the original loop with the appropriate original statements of the conditional 
statement. It al Wmav allow further parallelization of the loop by removing control-flow ii 
the loop body and also removing unnecessary computations from it. 
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An example of loop unswitching is: 

for (i =0; i<N; i++) { if (x > 2) 

a[i] = b[i] +3; for (i = 0; i<N; i++) { 

if (X > 2) a[i] = b[i] + 3; 

b[i] = c[i] +2; b[i] = c [i] +2; 

else } 
b[i]=c[i] - 2; else 
J for (i=0; i<N; i++) { 

a[i] = b[i] + 3; 
b[i] = c[i]- 2; 

} 

Figure 26: Example of loop unswitching 



If-Conversion 

Th wAn if-conversion transformation i smav be applied on loop bodies with conditional 
instructions. It ehan ^mav change control dopondonc e odependencies into data 
HnpmHnncen denendencies and aHew sallow then vectorization to take place. It can be used in 
rnnj iinctio nconiunction with loop unswitching to handle loop bodies with several basic 
blocks. The conditions^ where array expressions could appear^afe majibe replaced by 
Beelea nboolean terms called guards. Processors with predicated execution support can 
execute directly such code. 



An example of if-conversion is: 

for (i = 0; i<N; i + + ) { for (i = 0; i<N; i + + ) { 

a[i] = a[i] + b[i]; a[i] = a [i] + b[i]; 

if (atil != 0) C2 = (a[i] != 0) ; 

if (a[i] > c[i]) if (c2) c4 = (a[i] > c[i]); 

a[i] = a[i] - 2; if (c2 && c4) a[i] = a[il - 2; 

elge if (c2 && ! c4) a[i] = ali] + 1; 

a[i] = a[i] + 1; d[i] = a[i] * 2; 

d[i] = a[il * 2; } 

} 
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Figure 27: Example of if conversion 



Strip-Mining 

Ttw iA strip-mining transformation nnnV>inp tn nHjnptm a Y ^nahle adjustment of the granularity 
5 of an operation. It is commonly used tOr choose the number of independent computations in 
the inner loop nest. When the iteration count is not known at compile time, it can be used to 
generate a fixed iteration count inner loop satisfying the resource constraints. It can be used 
in conjunction with other transformations like loop distribution or loop interchange. It is also 
called loop sectioning. Cycle shrinking, also calledT stripping, is a specialization of strip- 
10 mining. 



An example of strip-mining is: 

for (i=0; i<N; i++) up = (N/16) *16; 

a[i] = b[i] + C; for(i=0; i<up; i = i + 16) 

a[i:H-16] = b[i:i+16] + c; 
for (j =i+l; j <N; j++) 
a[i] = b[i] + c; 
Figure 28: Example of strip mining 



15 



20 Loop Tiling 

This A loop tiling transformation medifiesmay_rnodi& the iteration space of a loop nest by 
introducing loop levels to divide the iteration space in tiles. It is a multi-dimensional 
generalization of strip-mining. It is generally used to improve memory reuset, but can also 
improve processor, register, TLB, or page locality. It is also called loop blocking. 



25 



The size of the tiles of the iteration space is mav be chosen so that the data needed in each tile 
fit in the cache memory, thus reducing the cache misses. In the case of coarse-grain 
computers, the size of the tiles can also be chosen so that the number of parallel operations of 
tiie loop body fkfits the number of processors of the computer. 



30 
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An example of loop tiling is: 

for (i=0; i<N; i++) for (ii=0; ii<N; ii = ii+16) 

for (j=0; j<N; j+ + ) for (jj=0; jj<N; jj = jj+16) 

a[il [j] =b[jl[i]; for (i = ii; i<min(ii + l5,N) ; j++) 

for (j=jj; j<min(jj+15,N) ; j++) 
a[i] [j] = b[j] [il ; 
Figure 29: Example of loop tiling 



Loop Interchange 

Thi sA loop interchange transformation i smav be applied to a loop nest to move inside or 
outside (depending on the searched effect) the loop level contaming data 
Hnpondonceo. dependencies. It can: 

• •—enable vectorization by moving inside an independent loop and outside a dependent 
loop, er 

• improve vectorization by moving inside the independent loop with the largest range. 

Of 

• • — deduce the stride, er 

• increase ±e nvimber of loop-invariant expressions in the inner-loop, or 

• ■— improve parallel performance by moving an independent loop outside of a loop nest 
to increase the granularity of each iteration and reduce the number of barrier 
synchronizations. 



An example of a loop interchange is: 

for (i=0; i<N; i++) for (j=0; j<N; j++) 

for (j=0; j<N; j++) for (i=0; i<N; i++) 

a[i] = a[i] +b[i][j]; a[i] =a[i] +b[i][j]; 



Figure 30: Example of loop interchange 
Loop Coalescing / Collapsing 
TUi^A Inn p coalescing / collapsing transformation eetabiaesmay combine a loop nest into a 
single loop. It can improve the scheduling of the loop, and also reduces the loop overhead. 
Collapsing is a simpler version of coalescing in which the number of dimensions of arrays is 
reduced as well. Collapsing i=ad«6e smav reduce the overhead of nested loops and HHiki- 
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Himn.nninnal multidimensional arrays. Collapsing can be applied to loop nests that iterate over 
memory with ata constant stride . Otherwise , ethefwise-loop coalescing i smay be a better 
approach. It can be used to make vectorizing profitable by increasing the iteration range of 
the innermost loop. 

An example of loop coalescing is: 

for (i=0; i<N; i++) for (k=0; k<N*M; k++) { 

for (j=0; j<M; j++) i = ( (k-1) /m) *m + 1 ; 

a[i][j] = a[i] [j] + c; j = ((T-1) %m) + 1; 

a[i] [j] = a[i] [j] + c; 

} 

Figure 31: Example of loop coalescing 



Loop Fusion 

1 5 Thi sA loop fusion transformation, also called loop jamming,-Haerge& ^ may merge two 
successive loops. It fed«ee smav reduce loop overhead, increases instruction-level 
parallelism, improves register, cache, TLB or page locality, and improves the load balance of 
parallel loops. Alignment can be taken into account by introducing conditional instructions 
to take care H o p o nHnncendependencies . 

20 

An example of loop fusion is: 

for (i=0; i<N; i++) for (i=0; i<N; i++) { 

a[i] = b[i] + c; a[i] = b[i] + c; 

d[i] = e[i] + c; 

25 for (i=0; i<N; i++) } 

d[i] = e[i] + c; 

Figure 32: Example of loop fusion 
Loop Distribution 

30 Thi sA loop distribution transformation, also called loop fission, altew smay allow to split a 
loop in several pieces in case the loop body is too big, or because of 
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Hnpnnrlnncen. dependencies. The iteration space of the new loops t smay be the same as the 
iteration space of the original loop. Loop spreading is a more sophisticated distribution. 

An example of loop distribution is: 

for (i=0;i<N; i++) { for (i=0; i<N; i++) 

a[i] = b[i] + c; a[i] = b[il + c; 

d[i] = e[i] + c; 



} 



for (i=0; i<N; i++) 
d[i] = e[i] + c; 
IQ Figure 33: Example of l o op distribution 



Loop Unrolling / UnroU-and-Jam 

JWA inr.p.inm11in p/unroll-and-iam transformation fepbeatesmay replicate the original 
loop body in order to get a larger one. A loop can be unrolled partially or completely. It 
15 is mav be used to get more opportunity for parallelization by making the loop body biggerrit. 
It may also impFevesimErove register^ or cache usage and reduces loop overhead. Loop 
unrolHng the outer loop followed by merging the induced inner loops is referred to as unroll- 
and-jam. 

20 An example of loop unrolling is: 

for (i=0; i<N; i++) for {i=0; i<N; i = i+2) { 

a [i] = b [i] + c; a[i] = b[i] + c; 

a[i + l] = b[i+ll + c; 

} 

25 if ((N-l)%2) == 1) 

a[N-l ] = b [N-1] + c; 
Figure 34: Example of loop unrolling 



Loop Alignment 

30 Tbi ^A loop alignment optimization teasfefmsmay transform the code to get aligned array 
accesses in the loop body. Its effect itmaybe to transform loop-carried 
dnpnndences dependencies mto loop-independent dopcndonc e sdependencies , which allows te 
«.,rt ^fnr extraction of more parallelism from a loop. It can use different transformations, 
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like loop peeling or introduce conditional statements, to achieve its goal. This transformation 
can be used in conjunction with loop fusion to enable this optimization by aligning the array 
accesses in both loop nests. In the example below, all accesses to array X become aligned. 

5 An example of loop alignment is: 
for (i=2; i <= N; i++) { 
a[i] = b[i] + c[i] ; 

d[i] = a[i-l] * 2; 
e[il = a[i-l] + d[i+l] ; 
10 } ) 

Figure 35: Example of loop aUgitmcfit 
Loop Skewing 

Th wA loop skewing transformation i smavbe used to enable parallelization of a loop nest. It 
1 5 kmavbe useful in combination with loop interchange. It ismavbe performed by adding the 
outer loop index multiplied by a skew factor,/, to the bounds of the inner loop variable, and 
then subtracting the same quantity from every use of the inner loop variable inside the loopf. 



for 


(i=l; 


i<=N; i++) { 


if 


(i>l) 


a[i] = 


b[i] + 


if 


(i<N) 


d[i+l] 


= a[i] 


if 


(i<N) 


e [i+l] 


= a[i] 



An example of loop skewing is: 

20 for (i=l; i<=N; i++) { for (i=l; i<=N; i++) { 

for (j=l; j<=N; j++) for (j=i + l; j< = i+N; j++) 

a[i] = a[i+j] + c; a[i] = a[j] + c; 

Figure 36: Example of loop okowing 

25 Loop Peeling 

Th wA loop peeling transformation remevesmajrremove a small number of beginning or 
ending iterations of a loop to avoid dependences in the loop body. These removed iterations 
m emavbe executed separately. It can be used for matching the iteration control of adjacent 
loops to enable loop fusion. 

30 
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An example of loop peeling is: 



for (i=0; i<=N; i++) 
a[il [N] = a[0] [N] + a [N] [N] ; 



a[0] [N] = a[0] [N] + a [N] [N] ; 
for (i=l; i<=N-l; i++) 

a[i] [N] = a[0] [N] + a [N] [N] ; 
a[N] [N] = a[0] [N] + a [N] [N] ; 



Figure 37: Example of loop peeling 
Loop Splitting 

Tb4s A loop splitting transformation eats mav cut the iteration space in pieces by creating othi 
loop nests. It is also called Index Set Splitting, and is generally used because of 
HnpnndonoQn dependencies that prevent parallelization. The iteration space of the new loops 
ismavbe a subset of the original one. It can be seen as a generalization of loop peeling. 



An example of loop splitting is: 

15 for (i=0; i<=N; i++) 

a[i] = a[N-i+l] + c; 



for (i=0; i<(N+l)/2; i++) 
a[i] = a[N-i + l] + c; 

for (i = (N+l)/2; i<=N; i++) 
a[i] = a[N-i+l] + c; 



20 



25 



30 



Figure 38: Example of loop splitting 
Node Splitting 

Thi^ A node splitting transformation spUts mav split a statement in pieces. It ismaybe used to 
break dependence cycles in the dependence graph due to the too high granularity of the 
nodes, thus enabling vectorization of the statements. 

An example of node splitting is: 

for(i=0; i<N; i++){ for (i=0; i<N; i++) { 

b[i] = a[i] + c[i] * d[i]; tl[i] = c[i] * d[i]; 

a [i+1] = b[i] * (d[il - c[il); t2 [i] = d[i] - c [i] ; 

J b [i] = a[i] + tl [i] ; 

a [i+1] = b[i] * t2 [i] ; 

} 
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Figure 39: Example of mdo splitting 



Scalar Expansion 

xiw ;A scalar expansion transformation j^^Oaeestnav replace a scalar in a loop by an array to 
5 eliminate Hppnndoncen dependencies in the loop body and enable parallelization of the loop 
nest. If the scalar is used after the loop, a.compensation code must be added. 

An example of scalar expansion is: 

for (i=0; i<N; i++) { for (i=0; i<N; i++) { 

10 c = b[i] ; tmp[i] = b[i] ; 

a[i] = a[i] + c; a[il = a[i] + tmp[i]; 

} } 

c = ttnp [N-1] ; 

Figure 40: Example of scalar expansion 
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Array Contraction / Array Shrinking 

xbio^An arra y contraction / arrav shrinking transformation is the reverse transformation of 
scalar expansion. It may be needed if scalar expansion generates too many memory 
requirements. 

An example of arrav contraction is: 

for (i=0; i<N; i++) for (i=0; i<N; i++) 

for (j=0; j<N; j++){ for (j=0; j<N; j++){ 

t[i] [j] = a[i] [j] * 3; t[j] = a[i] [j] * 3; 

b[i] [j] = t[i] [j] + c[j] ; b[i] [j] = t[j] + c[jl ; 

} > 



Figure 41: Example of array contraction 
Scalar Replacement 

30 q^O f^A scalar replacement transformation leplaeesmay replace an invariant array reference in 
a loop, by a scalar. This array element ismaybe loaded in a scalar before the inner loop and 
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stored again after the inner loopr if it is modified. It can be used in conjunction with loop 
interchange. 

An example of scalar replacement is: 

for (i=0; i<N; i++) for (i=0; i<N; i++) { 

for (j=0; j<N; j++) tmp = a[i]; 

a[i] = a[i] + b[i][j]; for (j=0; j<N; j++) 

tmp = tmp + b [i] [j] ; 
a[i] = tmp; 

} 



Figure 12: Example of scalar replacement 
Reduction Recognition 
J WA rpHiirtion recognition transformation allowo to handlemay allow handling of 
reductions in loops. A reduction i smavbe an operation that computes a scalar value from 
5 arrays. It can be a dot product, the sum or minimum of a vector for instance.^Ffee_A goal is 
then to perform as many operations in parallel as possible. One way isinajibe to accumulate 
a vector register of partial results and then reduce it to a scalar with a sequential loop. 
Maximum parallelism ismay then be achieved by reducing the vector register with a tree-. 
j.e.. pairs of etementsdeinents are summed^ then pairs of these results are summed^! etc. 



25 



An example of reduct ion recognition is: 

for (i=0; i<N; i++) for (i=0; i<N; i=i+64) 

s = s + a[i]; tmp [0:63] = tmp [0:63] + a[i:i+63]; 

for (i=0; i<64;i++) 
s = s + tmp [i] ; 



Figure 43: Example of reduction recognition 
Loop Pushing / Loop Embedding 
iki^A inr^p pis^hin p / loop embedding transformation fepteee smay replace a call in a loop 
30 body by the loop in the called function. It i smaybe an inter procodurolinterprocedural 

optimization. It aHewsmajLallow the parallelization of the loop nest and elinHnateseliminate 



NY01 1113556V1 



72 



MARKED-UP VERSION OF THE 
SUBSTITUTE SPECIFICATION 



the overhead caused by the procedure call. Loop distribution can be used in conjunction with 
loop pushing. 

An example of loop pushing is: 
for (i=0; i<N; i++) f2 (x) 

f (x,i) ; 

void f2 (int* a) { 
void f (int* a, int j ) { for (i=0; i<N; i++) 

a[j] = a[j] + c; a[i] = a[i] + c; 

} ) 

Figure 41: Example of loop pushing 



Procedure Inlining 

J3 ^A procedure inlining transformation replaces a call to a procedure by the code of the 
procedure itself, it ^ntnr prnnoHum linter procedural optimization. It allows a loop nest 
to be parallelized, removes overhead caused by the procedure call, and can improve locality. 

An example of procedure inlining is: 

for (i=0; i<N; i++) for{i=0; i<N; i++) 

f (a,i) ; = a[i] + c; 

void f (int*- int j) { 

x[j] = x[j] + c; 

} 

Figure 15: Example of procedure inlining 



Statement Reordering 

Xhi ^A ^.tatement reordering transformation schedules instructions of the loop body to modify 
the data dependence graph and enable vectorization. 
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An example of statem ent reordering is: 

for (i=0; i<N; i++) { for(i=0; i<N; i++) { 

a[i] = b[i] * 2; c [i] = a [i-1] - 4; 

c [i] = a[i-ll - 4; a[i] =b[i] * 2; 

} ) 



10 
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20 
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Figure 16: Example of statement reordering 
Software Pipelining 

33 ^A software pipelining transformation pamHefees m ay parallelize a loop body by 
scheduling instructions of different instances of the loop body. It ismaybe a powerfiil 
optimization to improve instruction-level paralleUsm. It can be used in conjunction with loop 
unrolling. In the example below, -the preload commands can be issued one after another, 
each taking only one cycle. This time is^just enough to request the memory areas. It is not 
enough to actually load them. This takes many cycles, depending on the cache level that 
actually has the data. Executionof a configuration behaves similarly. The configuration is 
issued in a single cycle, waiting until all data are present. Then the configuration executes for 
many cycles. Software pipelining overlaps the execution of a configuration with the preloads 
for the next configuration. This way, the XPP array can be kept busy in parallel to the 
Load/Store unit. 

An example of softw are pipelining is: 

Issue Cycle Command 
XPPPreloadConf ig (CFGl) ; 
for (i=0; i<100; ++i) { 

XPPPreload (2, a+10* i,10) ; 
XPPPreload (5,b+20*i, 20) ; 



30 



1 
2 
3 
4 
5 
6 



// delay 

XPPExecute (CFGl) ; 
} 



Issue Cycle Command 
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Prologue XPPPreloadConf ig (CFGl) ; 

XPPPreload (2, a, 10) ; 
XPPPreload (5, b,20) ; 
// delay 

for (i=l; i<100; ++i) { 
Kernel 

1: XPPExecute (CFG 1) ; 

2: XPPPreload {2, a +10*i,10) ; 

3: XPPPreload (5, b+20*i, 20); 

4: } 

XPPExecute (CFGl) ; 

Epilog // delay 

Figure 17: Example of sofUmrc pipelining 



Vector Statement Generation 

Xbi QA vector statement generation transformation Fepteeesmay replace instructions by vector 
instructions that can perform an operation on several data in parallel. 

An example of vector statemen t generation is: 

for (i = 0; i<N; i++) [0:N] = b[0:Nl; 

[i] = b[i] ; 

Figure 18: Example of vector statement generation 
3^3^3 — Data-Layout Optimizations 

In the folio- ving -v^ j^^^^^^n nptimWtitinnr thntOptimizations may modify the data layout in 
memory in order to extract more parallelism or prevent memory problems like cache misses. 
Examples of such optimizations are scalar privatization, array privat ization, and array 
merging. 

Scalar Privatization 

Bw;A scalar privatization optimization is mav be used in multi-processor systems to increase 
the amount of parallelism and avoid unnecessary conrnfiunications between the processing 
elements. If a scalar is only used like a temporary variable in a loop body, then each 
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processing element can receive a copy of it and achieve its computations with this private 
copy. 



An example of scalar privatization is: 

for (i = 0; i<=N; i + + ) { 
c = b[i] ; 
a[i] = a[i] + c; 

} 

Figure 49: Example for scalar privatization 



Array Privatization 

^3w iAn array privatization optimization i smavbe the same as scalar privatization except that 
it wetfe smav work on arrays rather than on scalars. 

Array Merging 

Tbi^ An array merging optimization trnnr . fnrmr. mav transform the data layout of arrays by 
merging the data of several arrays following the way they are accessed in a loop nest. This 
way, memory cache misses can be avoided. The layout of the arrays can be different for each 
loop nest. R o lnw i^ . th e The example code for array m erging presented below is an example of 
a cross-filter, where the accesses to mav ^arrav ' a' are interleaved with accesses to array br 
Th e p ic t u re n^'r rt 't rnpr n r , .ntn. thn Hntn lav out 'b'. Fig. 2 3 ill u s t r at es a data layout of both 
arrays, where blocks nf ^ rin frrae n 'a' rthe dark hi g hlighted portions) are merged with blocks 
nf A Cm yollow Vb' fthe lighter highlighted portionsV Unused memory space is in represented 
by the whiter portions. Thus, cache misses af emav be avoided as data blocks containing 
arrays V and Ibl are loaded into the cache when getting data from memory. More details 
can be found in po ^n^nipla aenhis & Svlvain T.elait. "A Case for Array Merging in Memory 
Hierarchies" Proceedings of the International W nrkxhop on Commlers for Parallel 
Computers. CPCOl rJune200n . 
for (j=l; j<=N-l; i++) 
for (j=l; j<=N; j++) 

b[i] [j] = 0.25* (a[i-l] [j]+a[i] [j-l]+a[i+l] tj]+a[i] [j+1] ) ; 

Figure 50: Example for array merging 
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— Example of application of the optimizations 
Ac cccn bcfore Tn accordance with that which is discussed above, it w ill he appreciated that a 
lot of optimizations can be performed on loops before and also after generation of vector 
statements. Finding a sequence of optimizations that- would produce an optimal solution for 
all loop nests of a program is still an area of research. Therefore wc can only propooe jnan 
embodiment of the present invention, a way to use these optimizations is provided that 
follows a reasonable heuristic to produce vectorizable loop nests. To vectorize the code, we 
eaa-ase-the Allen-Kennedy algorithm^ that uses statement reordering and loop distribution 
before vector statements are gftnerated .. can be used. It can be enhanced with loop 
interchange, scalar expansion, ^index set splitting, node splittmg, loop peeling. All these 
transformations are based on the data dependence graph. A statement can be vectorized if it 
is not part of a dependence cycle . Hence, henee-optimizations afemaybe performed to break 
cycles or, if not completely possible, to create loop nests without dependence cycles. 

Wc can divide theThe whole process i mnav be divided into four majors steps. First-we 
should rootructure^ the prnrf^dnres mav be restructured by analyzing the procedure calls inside 
the loop bodi'- q nii ^ ^- y ^'^ rnmmm thnm ThPn. Removal of the procedures may then be tried. 
Then, some high-level dataflow optimizations ^mav be applied to the loop bodies to modify 
their control-flow and simplify their code. The third step would conoipt inmay include 
preparing the loop nests for vectorization by building perfect loop nests and ensuring that 
inner loop levels are vectorizable. Then^ optimizations can be performed that target the 
architecture and optimize the data locality. It should also be noted that other optimizations 
and code transformations can occur between these different steps that can also help to further 
optimize the loop nests. 

Hence, the first step applies mav apply procedure inlining and loop pushing to remove the 
procedure calls of the loop bodies. Then, the second step conaisto ofmay include loop- 
invariant code motion, loop unswitching, strength reduction and idiom recognition. The third 
step can be divided in several subsets of optimizations. Wc con first apply loop _LooE 
reversal,T loop normalization and if-conversion mav be initiallv applied to get normalized 
loop nests. This nllnwn tn buil dmav allow building of the data dependency graph. Then» if 
dopondonceG dependencies prevent the loop nest to be vectorized, transformations eaamay be 
applied. For instance, if Ho.pnndoncer. dependencies occur only on certain iterations, loop 
peeling or loop splitting eaamav be applied. Node splitting, loop skewing, scalar expansion 
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or statement reordering can be applied in other cases. Then, loop interchange Hievesmay 
move inwards the loop levels without dependence cycles. TheA goal is to have perfectly 
nested loops with the loop levels carrying dependence cycles as much outwards as possible. 
Then wo can apply, loop, fusion, reduction recognition, scalar replacement / array 
contraction, and loop distributio n mav be applied to further improve the following 
vectorization. Vector statement generation can be performed at last using the Allen-Kennedy 
algorithm for instance. The last step can oonoiot of include optimizations iikesuchas loop 
tiling, strip-mining, loop unrolling and software pipelining that take into account the target 
processor. 

The number of optimizations in the third step i smav be large, but it may be that not all of 
them are applied to each loop nest. Following the goal of the vectorization and the data 
dependence graph, only some of them are applied. Heuristics afemay be used to guide the 
application of the optimizations^ that can be applied several times if needed. Lot uo illustrate 
thin , with nn mcnmplo. The following code is an exam ple of this: 



void f (int** a, int** b, int *c, int i, int j) { 
a[i] [j] = a[i] [j-1] - b [i+1] [ j -1] ; 

} 

void g(int* a , int* c, int i) { 
a[i] = c[i] + 2; 

} 

for (i=0; i<N; i++) { 

for (j=l; j<9; j=j++){ 
if (k>0) 

f (a, b, i, j ) ; 
else 

g(d, c, j) ; 
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} 

d[i] = d[i+l] + 2; 

} 

for (i=0; i<N; i++) 

a[i] [i] = b[i] + 3; 

The first step will find that inlining the two procedure calls is possibler-fteft Jlien loop 
unswitching can be applied to remove the conditional instruction of the loop body. The 
second step begins mav begin by applying loop normalization and analyses ^the data 
dependence graph. A cycle can be broken by applying loop interchange as it is only carried 
by the second level. The two levels ^mav be exchanged^ so that the inner level is 
vectorizable. Before^ that or also after, wo apply loop distributionv may be applied. Loop 
fusion can be applied when the loop on i is pulled out of the conditional instruction by a 
traditional redundant code elimination optimization. Finally, vector code can be generated 
for the resulting loops. 

Soinin more detMlsdetaU, after procedure inlining, wo obtainthe following may be obtained : 
for (i=0; i<N; i++) { 

for (j =1; j <9; j=j++) 
if (k>0) 

a[i] [j] = a[i] [j-1] - b[i + l] [j-1] ; 
else 

d[j] = c [j] + 2; 

} 

d[i] = d[i+l] + 2; 

} 

for (i=-0 ; i<N; i++) 

a[i] [i] = b [i] + 3; 
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After loop unswitching, wo obtaint he following may be obtained: 

if (k > 0) 

for (i=0; i<N; i++) { 

for (j=l; j<9; j=j++) 

a[i] [j] = a[i] [j-1] - b[i+l] [j-1] ; 
d[i] = d[i+l] + 2; 
} 

else 

for (i=0; i<N; { 

for (j=l; j<9; j=j++) 
d[j] = c[j] + 2; 
d[i] = d[i + l] + 2; 



for (i=0 ; i<N; i++) 

a [i] [i] = b[i] + 3; 

After loop normalization, wo nhtai nt he followin g mav be obtained : 
if (k > 0) 

for (i=0; i<N; i++) { 

for (j=0; j<8; j=j+ + ) 

a[i] [j+1] = a[i] [j] - b [ i+1] [j]; 
d[i] = d[i+l] + 2; 

} 

else 

for (i=0; i<N; i++) { 

for (j=0; j<8; j = j++) 
d[j] = c[j+l] + 2; 
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dti] = d[i+l] + 2; 



} 



for (i=0; i<N; i++) 

a[i] [i] = b[il + 3; 

After loop distribution and loop fusion, wn nhtiiinthe following may be obtained : 
if (k>0) 

for (i=0; i<N; i++) 

for (j=0; j<8; j=j+ + ) 

a[i] [j+1] = a[i] [j] -b[i + l][j]; 

else 

for (i=0; i<N; i++) 

for (j=0; j<8; j=j++) 
d[j ] = c[j+l] + 2; 

for (i=0; i<N; i++) { 
d[il = d[i+l] + 2; 
ati] [i] = b[i] + 3; 

} 

After loop interchange, wn nhtni n the followin g mav be obtained : 
if (k>0) 

for (j=0; j<8; j=j+ + ) 
for (i=0; i<N; i++) 

a[i] [j+1] = a[i] [j] -b[i + l][j]; 

else 

for- (i-=0; i<N; i++) 

for (j=©0; j<8; j-=j++) 
d[j ] = c[j+l] + 2; 

for (i=0 ; i<N; i++) { 
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d[i] = d[i + l] + 2; 
a[i] [i] = b[i] + 3; 

} 

After vector code generation, wn nhtninthe following mav be obtained: 
if (k> 0) 

for (j-0; j <8 ; j=j++) 

a [0:N-1] [j+1] = a[0:N— 1] [j] - b[0:N][j]; 

else 

for (i=0; i<N; i++) 

d[0:8] = c[l:9i + 2; 

d[0:N-l] = d[l:N] + 2 ; 
a[0:N-l] [0:N-1] = b[0:N] + 3 ; 

4 Compiler Specification for the PACT XPP 

4J Introduction 

A cached RISC-XPP architecture e^gOei temav exploit its full potential on code that is 
characterized by high data locality and high computational effort. A compiler for this 
architecture has to consider these design constraints. The compiler's primary objective is to 
concentrate computational expensive calculations to innermost loops and to make up as much 
data locality as possible for them. 

The compiler eentei mmav contain tisual analysis and optimizations. As interprocedural 
analysis, likee.g., alias analysis-, are especially useful, a global optimization driver i smay be 
necessary to ensure the propagation of global information to ail optimizations. The way the 
PACT XPP mav influence the compiler is disc ussed in the following sectior 
tho way the PACT XPP influenc e s the oompiler. . 



- Compiler Structure 

l -Fig. 24 provides a global view of the compiling procedure and shows *e-main steps 
the compiler HHisttnay follow to produce code for a system containing a RISC processor and 
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a PACT XPP. The next sections focus on the XPP compiler itself, but first the other steps are 
briefly described. 



Figure 51: Global View of the Compiling Process 
4^ — Code Preparation 

Th\r, r.tnp takeri Code preparation may take the whole program as input and can be considered 
as a usual compiler front-end. It wiUmay prepare the code by applying code analysis and 
optimizations to enable the compiler to extract as many loop nests as possible to be executed 
by the PACT XPP. Important optimizations are idiom recognition, copy propagation, dead 
code elimination, and all usual analysis like dataflow and alias analysis. 

4^ — Partitioning 

Partitioning deeide smav decide which part of the program is executed by the host processor 
and which part is executed by the PACT XPP. 

A loop nest i smav be executed by the host in three cases: 

• • — if the loop nest is not well-formed, 

• M if the number of operations to execute is not worth it to b e being executed on the 

PACT XPP, or 

• - — if it is impossible to get a mapping of the loop nest on the PACT XPP, 

A loop nest is said to be well-formed if the loop bounds and the step of all loops are constant, 
the loop induction variables are known and if there is only one entry and one exit to the loop 
nest. 

Another problem arise smav arise with loop nests where the loop bounds are constant but 
unknown at compile time. Loop tiling allows to ovorcomemay allow for overcoming this 
problem, kas will be described below. Nevertheless Jt could be that it is not worth iWe 
exeeute executing the loop nest on the PACT XPP if the loop bounds are too low. A 
conditional instruction testing if the loop bounds are large enough can be introduced, and 
3two versions of the loop nest ^ may be produced. One would be executed on the host 
processor, and the other on the PACT XPP when the loop bounds are suitable. This would 
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also ease applications of loop transformations, as possible compensation code would 
simpler due to the hypothesis on the loop bounds. 



4j2,3 — RISC Code Generation and Scheduling 

After the XPP compiler has produced NML code for the loops chosen by the partitioning 
phase, the main compilmg process Hwstmav handle the code that will be executed by the host 
processor where instructions to manage the configvirations have been inserted. This is feemi 
aim of the last two steps: 

• • — ^RISC Code Generation and 

• - — ^RISC Code Scheduling. 

The first one produc es mav produce code for the host processor and the second one 
nptimizea mav optimize it fiirther by looking for a better scheduling using software pipelining 
for instemce. 

4:3 XPP Compiler for Loops 

T7ignrf> S7 flflnanhon the Fie. 25 illustrates a detailed architecture and an internal processing of 
the XPP Compiler. It is a complex cooperation between program transformations, included 
in the XPP Loop optimizations, a temporal partitioning phase, NML code generation and the 
mapping of the configviration on the PACT XPP. 

Figure 52: Detailed Architecture of the XPP Compiler 

Firsts loop optimizations targeted at the PACT XPP aremaybe applied to try to produce 
innermost loop bodies that can be executed on the array of processors. If this is the case, the 
NML code generation phase 4 smav be calledrif^not^ then temporal partitioning i smay be 
applied to get several configurations for the same loop. After NML code generation and the 
mapping phase, it can also happen that a configuration will not fit on tike PACT XPP. In this 
case^ the loop optimizations m emav be applied again with respect to the reasons of failure of 
the NML code generation or of the mapping. If this new application of loop optimizations 
does not change the code, temporal partitioning t smav be appliedr-^Furthermore-we^teep 
track of, the number of attempts for the NML Code Generation and the mappingr tf may be 
kept track of If too many attempts are made^ and wc still do not obtain a solution is still not 
obtained , wo broalc the prncess r mav be broken and the loop nest wittmay be executed by the 
host processor. 
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4t3v4 — Temporal Partitioning 

Temporal partitioning spMt smav split the code generated for the PACT XPP iftinto several 
configurations if the number of operations, i. e. ^ the size of the configuration, to be executed 
in a loop nest exceeds the number of operations executable in a single configuration. This 
transformation is called loop disseverin g [6]. Thooo oonfigurationa are. See, for example, 
Joao M.P. Cardoso & Markus Weinhardt. "XPP-VC: A C Compiler with Temporal 
Partitioning for the PACT-XPP Architecture." Proceedings of t he. 12"' International 
Conference on Field-Programmable Loeic and Apvlicat i ons. FPL'2002, 2438 LNCS, 864- 
874 (2002). These con fi gurations may be then integrated in a loop of configurations whose 
number of execution corresponds to the iteration range of the original loop. 

4^ — Generation of NML Code 

Thi3 =tcp tolces Generation of NML code mav take as input an intermediate form of the code 
produced by the XPP Loop optimizations step, together with a dataflow graph buiU upon it. 
NML code can then be produced by using tree- or DAG-pattem matching techniques. 

— Mapping Step 

g^ A mapping step tdee smav take care of mapping the NML^ modules on the PACT XPP by 
placing the operations on the ALUs, FREGs, and BREGs, and routing the data through the 
buses. 

4A XPP Loop Optimizations Driver 

Tb eA goal of loop optimizations used for the PACT XPP ore now doaoribod. Thoir goal is to 
extract as much parallelism as possible from the loop nests in order to execute them on the 
PACT XPP by exploiting the ALU-PAEs as effectively as possible and to avoid memory 
bottlenecks with the IRAMs. The following sections explain how they aremay be organized 
and how to take into accoimt the architecture for applying tiie optimizations. 

4AA — Organization of the System 

Pi ^ iir^ hn iniir pr n rnnfr . thn nrcnniT^jitin n Fig. 26 provides a detailed view of the XPP_loop 
optimizations , including their organization . The transformations aremay be divided in six 
groups. Other standard optimizations and analysis ^mav be applied in-between. Each 
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group could be called several times. Loops over several groups can also occur if neededT. 
The number of iterations for each driver loop can be of constant value or determined at 
compile time by the optimizations itsel fthemselves. (e.g.^, repeat until a certain code quality is 
reached). In the first iteration of the loop, it can be checked if loop nests are usable for the 
PACT XPPT4t. It is mainly directed to check the loop bounds etc. For instance^ if the loop 
nest is well-formed and the data dependence graph does not prevent optimization, but the 
loop bounds are unknown, then^ in the first iteration loop^ tiling ismay be applied to get an 
innermost that is easier to handle and can be better optimized, and in the second iteration, 
loop normalization, if conversion, loop interchange and other optimizations can be^ applied to 
effectively optimize the inner-most loops for the PACT XPP. Nevertheless, this has not been 
necessary until now v^th the examples presented in the next chapt e rs, b elgWi 

With reference to Fig. 26. Group I ensure smav ensure that no procedure calls occur in the 
loop nest. Group II pfep^e smav prepare the loop bodies by removing loop-invariant 
instructions and conditional instruction to ease the analysis. Group III gonoratesmay generate 
loop nests suitable for the data dependence analysis. Group IV eentainsmay contain 
optimizations to transform the loop nests to get data dependence graphs that are suitable for 
vectorization. Group V eefitaift smav contain optimizations that ensure that the innermost 
loops can be executed on the PACT XPP. Group VI eeatamsmay contain optimizations that 
further extract parallelism from the loop bodies. Group VII eentmftsmay contain 
optimizations more towards optimizing the usage of the hardware itself. 

In each group, the application of the optimizations dependsmav depend on the result of the 
analysis and the characteristics of the loop nest. For instance, it is clear that not all 
transformations in Group IV are applied. It depends on the data dependence graph computed 
before. 

Figure 53: Detailed View of the XPP Loop Optimizations 
4 . 4 .2 — Loop Preparation 

The optimizations of Groups I, II and III of the XPP compiler may generate loop bodies 
without procedure calls, conditional instructions and induction variables other than loop 
control variables. Thus, loop nests, where the innermost loops are suitable for execution on 
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the PACT XPP, afe mav be obtained. The iteration ranges afemav be normalized to ease 
dependence analysis and the application of other code transformations. 



4*4:3 — Transformation of the Data Dependence Graph 

The optimizations of Group IV af emav be performed to obtain innermost loops suitable for 
vectorization with respect to the data dependence graph. Nevertheless^ a difference with 
usual vectorization is that a dependence cycle, featwiuch would normally prevent any 
vectorization of the code, does not prevent the optimization of a loop nest for the PACT XPP. 
If a cycle is due to an anti-dependence, then it could be that it wen^ll not prevent 
optimization of the code as stated in [7] TTiirthnrmnrRMarkus Weinhardt & Wavne Luk, 
"Pipeline Vectorization." IEEE Transactions on Comvuter- Aided Desien ofintesrated 
Circuits and Systems, 20(2^:234-248 (February 2001 1 Furthermore, dependence cycles will 
not pre- vent vectorization for the PACT XPP- when it consists only of a loop-carried true 
dependence on the same expression. If cycles with distance k occur in the data dependence 
graph, then this smcan be handled by holding k values in registers. This optimization is of the 
same class as cycle shrinking. 

Nevertheless, limitations due to the dependence graph exist. Loop nests cannot be handled if 
some dependence distances are not constant? or unknown. If only a few 
dopondonces dependencies prevent the optimization of the whole loop nest, this could be 
overcome? by using the traditional vectorization algorithm that sorts topologically the 
strongly connected components of the data dependence graph (statement reordering), and 
then e^pt vapplving loop distribution. This way, loop nests, which can be handled by the 
PACT XPP and some by the host processor, can be obtained. 



4A74 — Influence of the Architectural Parameters 

Some hardware specific parameters may influence the application of the loop 
transformations. The number of operations and memory accesses? that a loop body performs? 
i s may be estimated at each step. These parameters may_influence loop unrolling, strip- 
mining, loop tiling and also loop interchange (iteration range). 

The table below lists the? parameters tha t mav influence the application of the optimizations. 
For each of them, two data are given: a starting value computed from the loop? and a 
restriction value which is the value the parameter should reach or should not exceed after the 
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application of the optimizations. Vector length depicts the range of the innermost loops, Le.^ 
the number of elements of an array accessed in the loop body. Reused data set size represents 
the amount of data that must fit in the cache. I/O IRAMs, ALU, FREG, BREG stand for the 
number of IRAMs, ALUs, FREGs, and BREGs, respectively that ^con s titut e^the PACT 
5 XPP. The dataflow graph width represents the number of operations that can be executed in 
parallel in the same pipeline stage. The dataflow graph ^height represents the length of the 
pipeline. Configuration cycles amounts to the length of the pipeline^ and to the number of 
cycles dedicated to the control. The application of each optimization may 

• • — decrease a parameter's value (-), 
1 0 • • — increase a parameter' s value (+), 

• - — not influence a parameter (id), or 

• * — adapt a parameter's value to fit into the goal size (make fit). 

Furthermore, some resources must be kept for control in the configurationHhis^Jiis means 
1 5 that the optimizations should not make the needs exceed more than 70-80% each resource. 



Parameter 


Goal 


Starting Value 


Vector length 


IRAM size (256 words) 


Loop count 


Reused data set size 


Approx. cache size 


Algorithm analysis/loop sizes 


I/O IRAMs 


PACT size (16) 


Algorithm inputs + outputs 


ALU 


PACT size (< 64) 


ALU opcode estimate 


BREG 


PACT size (< 80) 


BREG opcode estimate 


FREG 


PACT size (< 80) 


FREG opcode estimate 


Data flow graph width 


High 


Algorithm data flow graph 


Data flow graph height 


Small 


Algorithm data flow graph 


Configuration cycles 


< command line parameter 


Algorithm analysis 



TTnrn nro r.nmo additional A dditional notations used- in the following descriptions^reas 
follows . feet-« beis the total number of processing elements available, r^js the width of the 

20 dataflow graph, mjjs the maximum number of input values in a cycles and ouhjs the 

maximum number of output values possible in a cycle. On the PACT XPP, n is the number 
of ALUs, FREGs and BREGs available for a configuration, r is the number of ALUs, FREGs 
and BREGs that can be started in parallel in the same pipeline stage^ and-, in and out amount 
to the number of available IRAMs. As IRAMs have 1 input port and 1 output port, the 

25 number of IRAMs yields directly the number of input and output data. 



NY01 11l3556v1 



88 



MARKED -UP VERSION OF THE 
SUBSTITUTE SPECIFICATION 



The number of operations of a loop body t smav be computed by adding all logic and 
arithmetic operations occurring in the instructions. The number of input values is the number 
of operands of the instructions regardless of address operations. The number of output values 
is the number of output operands of the instructions regardless of address operations. To 
determine the number of parallel operations, input and output values, and the dataflow graph 
must be considered. The effects of each transformation on the architectural parameters are 
now presented in detail. 

Loop Interchange 

Loop interchange i smav applied when the innermost loop has a too narrow iteration range. In 
that case, loop interchange nllnwn to hav e mav allow for an innermost loop with a in-Memore 
profitable iteration range. It can also be influenced byr- the layout of the data in memory. It 
can be profitable to data locality to interchange two loops to get a more practical way to 
access arrays in the cache and therefore prevent cache misses. It is of course also influenced 



by data dopondenc e s dependencies as explained earli e r, above. 



Parameter 


Effect 


Vector length 


+ 


Reused data set size 


make fit 


I/OIRAMs 


id 


ALU 


id 


BREG 


id 


FREG 


id 


Data flow graph width 


id 


Data flow graph height 


id 


Confieuration cycles 




Loop Distribution 

Loop distribution ismav be applied if a loop body is too big to fit on the PACT XPP. ftsA 
main effect of loon distribution is to reduce the processing elements needed by the 
configuration. Reducing the need for IRAMs can only be a side effect. 


Parameter 


Effect 


Vector length 


id 


Reused data set size 


id 


I/OIRAMs 


make fit 


ALU 


make fit 


BREG 


make fit 


FREG 


make fit 


Data flow graph width 




Data flow graph height 
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Configuration cycles 



Loop Collapsing 

Loop collapsing can be used to make the loop body use more memory resources. As several 
dimensions are merged, the iteration range is increased and the memory needed inis increased 
as well. 



Parameter 


Effect 


Vector length 




Reused data set size 




I/O IRAMs 


+ 


ALU 


id 


BREG 


id 


FREG 


id 


Data flow graph width 


+ 


Data flow graph height 


+ 


i Configuration cycles 


+ 



Loop Tiling 

Loop tiling, as multi-dimensional strip-mining, is influenced by all parametersrit4s Jtm^ 
10 be especially useful when the iteration space is by far too big to fit in the IRAM, or to 

guarantee maximum execution time when the iteration space is unbounded (300 Section 1.1. 
64. See the discussion below under the heading "Limiti ng the Execution Time of a 
Configuration." It can then make the loop body fit with respect to the resources of the PACT 
XPP, namely the IRAM and cache line sizes. The size of the tiles for strip-mining and loop 
1 5 tiling can be computed lilc e this as: 

tile size = resources available for the loop body /resources necessary for the loop body^ 

The resources available for the loop body are the whole resources of the PACT XPP for this 
configuration. A tile size can be computed for the data and another one for the processing 
20 elementsj-the Jie final tile size is then the mmimum between these two. For instance, when 
the amount of data accessed is larger than the capacity of the cache, loop tiling ea» may be 
applied likn h n lnw according to the following example code for loop tiling for the PACT 
XPP. 

for (i=0; i<=1048576; i++) for (i=0; i<=1048576; i+= CACHE_SIZE) 
25 <100p body> for {j=0; j<CACHE_SIZE; j +=IRAM_SIZE) 

for (k=0; k<IRAM_SIZE; k++) 
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<tiled loop body> 



Figure 54: Example of loop tiling for the PACTXPP 



Parameter 


Effect 




make fit 


Reused data set size 


make fit 


I/O IRAMs 


id. 


ALU 


id 


BREG 


id 


FREG 


id 


Data flow graph width 


+ 


Data flow graph height 


+ 


Configuration cycles 


+ 



5 Strip-Mining 

Strip-mining i smav be used to make the amovint of memory accesses of the innermost loop fit 
with the IRAMs capacity. The processing elements do not usually represent a problem as the 
PACT XPP has 64 ALU-PAEs which should be sufficient to execute any single loop body. 
Nevertheless, the number of operations can be also taken into account the same way as the 
10 data. 



Parameter 


Effect 


Vector length- 


make fit 


Reused data set size 


id 


I/O IRAMs 


id 


ALU 


id 


BREG 


id 


FREG 


id 


Data flow graph width 


+ 


Data flow graph height 




Configuration cycles 





Loop Fusion 

Loop fusion i smav be applied when a loop body does not use enough resources. In this case^ 
1 5 several loop bodies can be merged to obtain a configuration using a larger part of the 
available resources. 



Parameter 


Effect 


Vector length 


id 


Reused data set size 


id 


yO IRAMs 




ALU 


+ 


BREG 


+ 
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FREG 


+ 


Data flow graph width 


id 


Data flow graph height : 




Configuration cycles 


+ 



Scalar Replacement 

The amount of memory needed by the loop body should always fit in the IRAMs. 
5 ThanksDue to riO sa scalar replacement optimization, some input or output data represented 
by array references? that should be stored in IRAMsrafejnay^ replaced by scalars; that are 



either stored in FREGs or kept on buses. 




Parameter 


Effect 


Vector length 


+ 


Reused data set size 


id 


VO IRAMs 


id 


ALU 


id 


BREG 


id 


FREG 


id 


Data flow graph width 


+ 


Data flow graph height 




Configuration cycles 


id 



Loop Unrolling 

1 0 Loop unrolling, loop collapsing, loop fusion and loop distribution aremay be influenced by 
the number of operations of the body of the? loop nest- and the number of data inputs and 
outputs of these operations, as they modify the size of the loop body. The number of 
operations should always be smaller than n, and the number of input and output data should 



always be smaller than in and out. 



Parameter 


Effect 


Vector length 


id 


Reused data set size 


id. 


I/O IRAMs 


+ 


ALU 


+ 


BREG 


H- 


FREG 


+ 


Data flow graph width 


id 


Data flow graph height 


+ 


Configuration cycles 


+ 



15 
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UnroU-and-Jam 

UnroU-and-Jam consists in mav include unrolling an outer loop and then merging the inner 
loops. It must compute the unrolling degree u with respect to the number of input memory 
accesses m and output memory accesses p in the inner loop. The following inequality must 



5 hold: u'^m<in^u *p< out. Moreover^ the number of operations of the new inner loop 
must also fit on the PACT XPP. 



Parameter 


Effect 


Vector length 


id 


Reused data set size 


+ 


I/O IRAMs 


+ 


ALU 


+ 


BREG 




FREG . 


+ 


Data flow graph width 


id 


Data flow graph height 


+ 


Configuration cycles 





4s4^5 — Optimizations Towards Hardware Improvements 
10 At this step other optimizations, specific to the PACT XPP, can be made. These 

optimizations deal mostly with memory problems and dataflow considerations. This is the 
case of shift register synthesis, input data duplication (similar to scalar privatization), or loop 
pipelining. 

1 5 Shift Register Synthesis 

Tbif^A shift re gister synthesis optimization deals with array accesses that occur during the 
execution of a loop body. When several values of an array are alive for different iterations, it 
can be convenient to store them in registers^ rather than accessing memory each time they are 
needed. As the same value must be stored in different registers depending on the number of 

20 iterations it is alive, a value shares several registers and flows from a register to another at 
each iteration. It is similar to a vector register allocated to an array access with the same 
value for each element. This optimization is performed directly on the dataflow graph by 
inserting nodes representing registers when a value must be stored in a register. In the PACT 
XPP, it amounts to ster estoring it in a data register. A detailed explanation can be found in 

25 f44 ^Markus Weinhardt & Wavne Luk. "Memory Access Optimization for Reconfigurab le 
Svstems." IEEE Proceedings Computers and Digital Techniques. 48(3) (May 2001). 
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Shift register synthesis i smav be mainly suitable for small to medium amounts of iterations 
where values are alive. Since the pipeline length increases v^ith each iteration for which the 
value has to be buffered, the following method is better suited for medium to large distances 



between accesses in one input array. 

5 



Nevertheless^ this method w^ssmav work very well for image processing algorithms which 
mostly alter a pixel by analyzing itself and its surrounding neighbors. 


Parameter 


Effect 


Vector length 


+ 


Reused data set size 


id 


I/OIRAMs 


id 


ALU 


id 


BREG 


id 


FREG 


id 


Data flow graph width 


+ 


Data flow graph height 




Configuration cycles 


id 



Input Data Duplication 

10 Thi sAn input data duplication optimization is orthogonal to shift register synthesis. If 

different elements of the same array are needed concurrently, instead of storing the values in 
registers, the same values ag emav be copied in different IRAMs. The advantage against shift 
register synthesis is the shorter pipeline length, and therefore the increased parallelism, and 
the unrestricted applicability. On the other hand, the cache-IRAMM bottle-neck can affect 

1 5 the performance of this solution, depending on the amoimts of data to be moved. 

Nevertheless wo onnume . it is asstimed that cache IRAM transfers are negligible to transfers 



in the rest of the memory hierarchy. 



Parameter 


Effect 


Vector length 


+ 


Reused data set size 


id 


I/O IRAMs 


id 


ALU 


id 


BREG 


id 


FREG 


id 


Data flow graph width 


+ 


Data flow graph height 




Configuration cycles 


id 
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Loop Pipelining 

This A loop optimization con s i s ts in pipelining optimization mav include synchronizing 
operations by inserting delays in the dataflow graph. These delays aremay be registers. For 
the PACT XPP, it amounts to stof estoring values in data registers to delay the operation using 
5 them. This is the same as pipeline balancing performed by xmap. 



Parameter 


Effect 


Vector length 


+ 


Reused data set size 


id 


I/OIRAMs 


id 


ALU 


id 


BREG 


id 


FREG 


id 


Data flow graph width 


+ 


Data flow graph height 




Configuration cycles 


+ 



Tree Balancing 

Thi sA tree balancing optimization cnnr.iato in m av include balancing the tree representing the 
1 0 loop body. It fedaee smav reduce the depth of the pipeline, thus reducing the execution time 
of an iteration, and inor e os e s mav increase parallelism. 



Parameter 


Effect 


Vector length 


+ 


Reused data set size 


id 


I/O IRAMs 


id 


ALU 


id 


BREG 


id 


FREG 


id 


Data flow graph width 


+ 


Data flow graph height 




Configuration cycles 





4A,€ — Limiting the Execution Time of a Configuration 

The execution time of a configuration must be controlled. This is ensured in the compiler by 
1 5 strip-mining and loop tiling that take care that not more input data asthan the IRAM sIRAM's 
capacity come in the PACT XPP in a cycle. This way the iteration-range of the innermost 
loop that i&- executed on the PACT XPP is limited, and therefore its execution time. 
Moreover, partitioning ensures tiiat loops, whose execution count can be computed at run 
time, are going to be executed on the PACT XPP. This condition is tiivial for for-loops, but 
20 for while-loops, where the execution covmt cannot debe determined statically, a 
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transformation liko Gkotch e d exemplified bv the code below can be applied. As a result, the 
inner for-loop can be handled by the PACT XPP. 

while (ok) { while (ok) 

<loop body> for (i=0; i<100 && ok; 

i++) { 

J <loop body> 

} 

Figure 55: Transformation of while loops 



S Case Studies 

3x3 Edge Detector 

Original Code 
Source Code: 

The following is source code: 
#define VERLEN 16 
#define HORLEN 16 
mainO { 

int V, h, inp; 

int pi [VERLEN] [HORLEN] ; 

int p2 [VERLEN] [HORLEN] ; 

int htmp, vtmp, sum; 



for (v=0; v< VERLEN; v++) //loop nest 1 

for (h=0 ; h<HORLEN; h+ + ) { 

scanf ("%d", 6c-pl [v] [h] ) ; //read input pixels to pi 
p2 [v] [h] = 0; //initialize p2 

} 

for (v=0; v<=VERLEN-3 ; v+ + ) { //loop nest 2 
for (h=0; h<=H0RLEN-3; h++) { 

htmp = (pl[v+2] [h] - pl[v] [h]) + 

(pl[v+2] [h+2] - pl[v] [h+2]) + 
2 * (pl[v+2] [h+1] - pl[v] [h+1]) ; 
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if (htmp < 0) 
htmp = -htmp ; 



vtmp = (pl[v] [h+2] - pl[v] [h]) + 

(pl[v+2] [h+2] - pl[v+2] [h] ) + 

2 * (pl[v+l] [h+2] - pl[v+l] [h]) ; 

if (vtmp < 0) 
vtmp = -vtmp ; 

sum = htmp + vtmp; 
if (sum > 2 55) 

sum = 2 55; 
p2 [v+1] [h+1] = sum; 

} 

} 

for (v=0; v<VERLEN; v++ ) //loop nest 3 

for (h=0 ; h<HORLEN; h++) 
printf ("%d\n" , p2 [v] [h] ) ; //print output pixels from p2 

} 

SAi3> — Preliminary Transformations 

Interprocedural Optimizations 
The first step normally invokes interprocedural transformations like function dining and loop 
pushing. Since no procedure calls are within the loop body, these transformations are not 
applied to this example. 

Partitioning 

The partitioning algorithm chooses which code runs on the RISC processor and which code 
runs on the XPP. Since we-only consider inner loops are considered t o run on the XPP, the 
basic blocks are annotated with the loop nest depth. Thus^ basic blocks which are not in a 
loop are separated out. Furthermore^ function calls within a loop body prevent a loop to be 
considered for running on the XPP. 



NY01 1113556V1 



97 



MARKED-UP VERSION OF THE 
SUBSTITUTE SPECIFICATION 



In our benchmark^ the loop nests 1 and 3 are marked as to run on the RISC host because of 
the function call. In the following sections they are not considered any further. 

5 It is to say that at this compilation stage it is not predictable if the remaining loop nests can be 
synthesized for the XPP. Wo just soparat e d Just the ones which definitely cannot run on it; 
fltbef s were separated. Others may follow, since running the code on the RISC CPU is 
always the reassurance in e wthis strategy. 

10 Loop Analysis and Normalization 

The code upon has already normalized loops. Nevertheless^, it is more likely that human 
written code would look lik e b e approximately as follows: 

for (v=l; V < VERLEN - 1; v+ + ) { 
for (h=l; h<HORLEN - 1; h++) { 
15 htmp = (pl[v+l] [h-1] - pl[v-l] [h-1] ) + 

(pl[v+l] [h+1] - pl[v-l] [h+1]) + 
2 * (pl[v+l] [h] - pl[v-l] [h] ) ; 
if (htmp < 0) 

htmp = -htmp; 

20 

vtmp = (pi [v-1] [h+1] - pi [v-1] [h-1] ) + 
(pi [v+1] [h+1] - pi [v+1] [h-1] ) + 
2 * (pi [v] [h+1] - pi [v] [h-1] ) ; 
if (vtmp < 0) 
25 vtmp = -vtmp; 

sum = htmp + vtmp; 
if (sum > 255) 

sum = 255; 
p2 [v+1] [h+1 ] = sum; I 

} 



30 

} 
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Although seen at first sight by a human reader, it is not obvious for the compiler that the loop 
is well formed. Thprpfore it \n tried to normaliz e - normalizing of the loop is attempted. 



If the original loop induction variable is called / with the increment value s and lower and 
upper loop bounds / and m, respectively, then the normalized loop with the induction variable 

and the upper bound (the lower bound is- 0 -by- definition) is transformed as 
follows: 

• »— The upper boimd calculates to u" = (u-l)/s. 

• »— All occurrences of i are replaced by 1 + i-1 * s. 

Applied to the code above, the loop statement for (v=l; v < VERLEN - 1; v++) with the 
lower bound vl = 1, the upper bound vu = 14 ( < 15 means <= 14 in integer arithmetic) and 
the increment vs = 1 transforms to 

for (vn=0; vn <= (vu-vl) /vs; vn+ + ) 

or simplified 

for (vn= 0 ; vn <= 13 ; vn++ ) 

The 'h-loop' is transformed equally, issuing the original code. 



In the second step, idiom recognition finds the abs() and minQ structures in the loop body. 
Ploaso noteNgte that although the XPP has no abs opcode, it can easily be synthesized and 
should therefore be produced to simplify the internal representatio n (otherwise. (Otherwise, 
if-conversion has to handle this case which increases the complexity). } 

Therefore, the code after idiom recognition Inok r, lik e is approximately as follows ( absQ and 
min() are compiler known fiinctions which are directly mapped to XPP opcodes or predefined 
NML modules): 

for (v=0; v<=16-3; v++) { 

for (h=0; h<=16--3; h++) { 

htmp = (pi [v+2] [h] - pl[v] [h]) + 



Idiom Recognition 



(pi [v+2] [h+2] 
2 * (pi [v+2] 
abs (htmp) ; 



- pi [v] [h+2] ) 
[h+1] - pi [v] 



[h+1]) ; 



+ 



htmp = 
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vttnp = (pi [v] [h+2] - pi [v] [h] ) + 

(pi [v+2] [h+2] - pi [v+2] [h] ) + 
2 * {pl[v+l] [h+2] - pl[v+l] [h] ) ; 
vtmp = abs (vtmp) ; 

sum = min (htmp + vtmp, 2 55) ; 
p2 [v+1] [h+1] = sum; 



Dependency Analysis 

for (v=0; v<=16-3; v++) { 



for(h=0; h<=16-3; h++) { 



SI 



htmp = (pl[v+2][h] - pl[v][h]) + 

(pi [v+2] [h+2] - pl[v] [h+2]) + 



2 * (pi [v+2] [h+1] - pi [v] [h+1] ) ; 



S2 



htmp = abs (htmp) ; 



S3 



S4 



vtmp = (pi [v] [h+2] - pl[v] [h] ) + 

(pi [v+2] [h+2] - pi [v+2] [h]) + 
2 * (pi [v+1] [h+2] - pi [v+1] [h] ) ; 
vtmp = abs (vtmp) ; 



S5 



sum = min (htmp + vtmp, 2 55) ; 



p2 [v+1] [h+1] = sum; 
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Figure 56 The expression tree of the edge 3x3 inner loop body 
There are no loop carried dependencies which prevent pipeline vectorization. The loop 
independent scalar dependencies do not prevent pipeline vectorization since the 
transformation does not disturb the order of reads and writes. Furthermore^ forward 
expression substitution / dead code elimination will remove the scalars completely. 

gArS — Pre Code Generation Transformations 

Forward Expression Substitution / Dead Code Elimination 

The lack of uses of htmp, vtmp and sum after the loop nest allows forward expression 
substitution along with dead code elimination to place the whole calculation into one 
statement. 

p2 [v+1] [h+1] = min (abs ((pi [v+2] [hi] - pi [v] [h] ) + 

(pi [v+2] [h+2] pl[v][h+2]) + 
2 * (pi [v+2] [h+1] - pi [v] [h+1])) + 
abs ( (pi [v] [h+2] - pi [v] [h] ) + 
(pi [v+2] [h+2] - pi [v+2] [h] ) + 
2 * (pi [v+1] [h+2] - pi [v+1] [h] ) ) , 255) ; 



The scalar accesses then disappear completely. 



Mapping to IRAIMs 

The array accesses are mapped to IRAMs. At this stage the IRAM numbers are chosen 
arbitrarilyrfee^JDie actual mapping to XPP IRAlVIs is done later. 

Therefore wo rename pl ^[v+x][h+y] and p2[v+X][h-i-y] are renamed t o iramN[y]):, (e.g^, p 
l[v+2][h] to iram2[0]). Th eAccordinglv. the code r o ads th e ni s 

iram3 [ 1 ] = min (abs (iram2 [ 0] - iramO [0] ) + 
(iram2[2] - iramO [2] ) + 
2 * (iram2 [1] - iramO [1] ) + 
abs (iramO [2] - iramO [0] ) + 
(iram2[2] - iram2 [0] ) + 
2 * (iraml[2] - iraml [0] ) , 255) ; 
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Tree Balancing 

Fig. 27 shows an expression tree of an edge 3x3 inner loop body. T he visualized expression 
tree in Figaro 56 of Fig. 27 shows another valuable optimization before matching the tree. 
Since the depth of the tree determines the length of the synthesized pipeline, another 

5 simplification can decrease this depth. In both of the main sub trees^ the operands of the 

commutative add expressions can be interchanged to reduce the overall tree depth. Figure j: 
^Qf^ ^A resulting expression tree is shown in Fie. 28. In Fig. 28. one of the sub trees is shown 
before and after balancing ^The numbers represent the annotated maximum tfeesb[ee depth 
fi-om the node to its deepest child leaf node^ 

10 Tho resulting oxpr e osion troo io shown in Figvir e 57. 



§AA — XPP Code generation 

Pipeline Synthesis 

As already stated^ the pipeline is synthesized by a dynamic programming tree matcher. In 
1 5 contrast to sequential processors, it does not generate instructions and register references, but 
PAE opcodes and port connections. Th eFig. 29 shows the main calculation network is s hown 
in Fipiirn SV nf the edge 3x3 configuration. The M T JLTI-SORT combination does the absQ 
calculation, while the SORT does the minO calculation. The input data preparation network 
is not shown in thin n^uro. Th e Fig. 29. Fig. 30 shows the case of synthesized shift registers 
20 arc ohown in Figure 59» while the variant with duplicated input data simply consists 

e fincludes an IRAM for each input channel in Figure 5 8 .Fig. 29. With respect to Fig. 30, 
there is one input after the shift register synthesis. The le ftmost input contains pUirh], the 
middle one contains pi nih+l . and the rightmost on e contains pi nrh+21. 

25 Although this is straight forward, there remains the question how to access the different 
offsets of the vector register accesses. Although the RAM-PAEs are dual ported, it is 
obvious that it is not possible to read different addresses concurrently. 

Since it is not efficient to synthesize a configuration which generates the different addresses 
30 sequentially and demultiplexes the read operands into different branches of the data flow, 
other arrangements have to be made. 
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The two possibilities to access input data pr e sent e d in Gubs e ction ^ . 1. 5 discussed above 
under the heading "Optimizations Towards Hardware Improvements" yield the following in 
RISC pseudo code and XPP utilization. ^The pseudo code running on the RISC core leeks 
tifeeis a pproximately: 
5 XPPPreload (conf ig) 

for (v=0; v< = 16-3; v++) { 

XPPPreload { 0 , &pl [v] , 16) 

XPPPreload ( 1 , &pl [v+1] , 16) 

XPPPreload (2, &pl [v+2] , 16) 

10 XPPPreloadClean (3, &p2 [v+1] , 16) 

XPPExecute (config, IRAM(O), IRAM(l), IRAM(2), IRAM(3)) 

} 



for shift register synthesis and approximately: 
15 XPPPreload (conf ig) 

for(v=0; v<=16-3; v++) { 

XPPPreload ( 0 , &pl [v] , 16) 
XPPPreloadd, &pl [v] , 16) 
XPPPreload (2 , &pl [v] , 16) 
20 XPPPreload(3, &pl [v+1] , 16) 

XPPPreload(4, &pl [v+1] , 16) 
XPPPreload ( 5 , &pl [v+2] , 16) 
XPPPreload ( 6 , &pl [v+2] , 16) 
XPPPreload(7, &pl [v+2] , 16) 
25 XPPPreloadClean ( 3 , &p2 [ v+ 1 ] , 16) 

XPPExecute (config, IRAM(O), IRAM(l), IRAM(2), IRAM(3), 

IRAJyi(4) , IRAM(5) , IRAM(6) , IRAM (7) ) 

} 

for data duplicatio n, roapectively. ^ 
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urc 58 The main calculation network of the cdgc3x3 configuration. The AfULT SORT 
combination docs the absQ calculation while the SORT docs the mind calculation. 



Figure 59 One input after shift register synthesis. Tfic leftmost input contains pi [] Qt], the 
5 middle one pi [JPt+i J and the rightmost plQ [h i 2], respectively. 

The values for place & route and simulation are compared in the following table. Note that a 
common RISC DSP with two MAC units and hardware loop support needs about 4000 cycles 
for the same code. This comparison does not account for cache misses. Furthermore^ it is 
obvious^ that the number of input values is very small in this example and the DSP 
1 0 calculation time is proportional to that number. The XPP performance on the other hand will 
improve with the number of input values. Therefore, the XPP performance will be more 
impressive with bigger image sizes. 



Parameter 


Value (shift register synthesis) 


Value (data duplication) 


Vector length 


16 


16 


Reused data set size 


256 


256 


I/OIRAMs 


3 1+ 1 0 = 4 


81 + 1 0 = 9 


ALU 


27 


21 


BREG 


21 (,1 defined + 20 route) 


10(1 defined + 9 route) 


FREDFREe 


22 (9 defined + 23 route) 


19 (3 defined + 16 route) 


Data flow graph width 


14 


14 


Data flow graph height 


3 (shift reg 
(calcul 


isters) + 8 
ation) 


8 (calculation) 


Configuration cycles 
(simulated) 


configuration 

preloads* 
(assuming 4 
words/cvcle 
burst transfer') 

cycles 
sum 


2262 
14*3*4 168 

14*57 798 
3228 


configuration 
preloads 

cycles 
svun 


2145 
8*8*4 256 

14*52 728 
3129 



SJrrS — Enhancing Parallelism 
1 5 After the synthesis, the configuration calculating the inner loop utilizes 27 ALUs and 4 
IRAMs for shift register synthesis and 21 ALUs and 9 IRAMs for data duplication. 



assuming 1 words/cycle burst transf e r 
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respectively. Assuming aan XPP64 core^ this leaves plenty of room for further optimizations. 
Nevertheless, since all optimizations enhancing parallelism are performed before the 
synthesis takes place, it is crucial that they estimate the needed resources and the benefit of 
the transformation very carefully. Furthermore^ they have to account for both input 
preparation strategies to estimate correct values. 

Loop Unrolling 

Fully unrolling the inner loop would not lead to satisfying resultsj because the number of 
inputs and outputs increases dramatically. That^s means data duplication would not be 
applicable and shift register synthesis would exhaust most of the benefits of the parallelism 
by producing a very long pipeline for each data flow graph. Although partial unrolling of the 
inner te-o ploop would be applicable^ it promises not much benefit for the area penalty 
introduced. 



Loop unrolling the outer loop is also not applicable since it produces a further configuration. 
Nevertheless^ a related transformation could do a good job on this loop nest. 



UnroU-and-Jam 

The unroU-and-jam algorithm enhances parallelism and also improves IRAM usage. It brings 
pairs of iterations together ideally reusing IRAM outputs and calculation results. The 
algorithm partially unrolls the outer loop and fuses the originated inner loops. Before the 
unroU-and-jam is performed^ the so-called unroU-and-jam factor must be determined^ which 
denominates the unrolling factor of the outer loop. This is mainly influenced by the number 
of ALUs n (= 64 assuming XPP64) and calculates to 
H 64 

^ „ ^ . = — ^ — = — = 2 (integer divisioa 

^unroll -and -jam 97 



Thus the source code would be transformed tOr-: 
for (v=0; v<=VERLEN-3; v+=2) { 
for (h=0; h<=HORLEN-3; h++) { 

p2[v+l] [h+1] = min(abs((pl[v+2] [h] - pi [v] [h] ) + 

(pl[v+2] [h+2] - pl[v] [h+2]) + 
2 * (pl[v+2] [h+1] - pl[v] [h+1])) + 
abs( (pl[v] [h+2] - pi [v] [h] ) + 
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p2 [v+2] [h+1] 



(pi [v+2] [h+2] -pl[v+2][h]) + 
2 * (pi [v+1] [h+2] - pi [v+1] [h] ) ) , 255); 
= min(abs( (pi [v+3] [h] - pi [v+1] [h] ) + 

(pl[v+3] [h+2] - pi [v+1] [h+2]) + 
2 * (pi [v+3] [h+1] - pi [v+1] [h+1])) + 
abs((pl[v+l] [h+2] - pi [v+1] [h] ) + 

(pi [v+3] [h+2] - pl[v+3][h]) + 
2 * (pi [v+2] [h+2] - pi [v+2] [h] ) ) , 255); 



} 



The transformation introduces additional accesses to pi [v+3] [h], pl[v+3][h+2], 
pl[v+3][h+l], and pi [v+l][h+l] (the former hole in the access pattern) as well as a write 
access to p2[v+2][h+l]. ThatThis means 2 IRAMs more for shift register synthesis (one 
input, one output) and 5 IRAMs more for dataj duplication (4 input, 1 output), while 
performeince is doubled. 



Parameter 


Value (shift register 
synthesis) 


Value (data duplication - 
no IRAM placement) 


Vector length 


16 


16 


Reused data set size 


256 


256 


I/O, IRAMs 


41 + 20 = 6 


121 + 2 0= 14 


ALU 


45 


37 


BREG 


31 (12 defined + 19 route) 


42 (4 defined + 38 route) 


FREG 


29 (1 defined + 28 route) 


18 (1 defined + 17 route) 


Data flow graph width 


14 


14 


Data flow graph height 


3 (shift registers) + 8 
(calculation) 


8 (calculation) 


Configuration cycles 
(simulated) 


configuration 
preloads 
cycles 
svim 


2753 
7*4*4 112 
7*53 371 

3236 


configuration 
preloads 
cycles 
sum 


2754 

7*12*4 
336 

7*69 483 
3573 
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Parameter 


Value (data duplications - with 
IRAM placement) 




Vector length 


16 




Reused data set size 


256 




I/OIRAMs 


121 + 20= 14 




ALU 


37 




BREG 


36 (4 defined + 32 route) 




FREG 


24 (1 defined + 23 route) 




Data flow graph width 


14 




Data flow graph height 


3 fshift registers) + 8 (calculation) 




Configuration cycles 
(simulated) 


configuration 
preloads 
cycles 
sum 


2768 

7*12*4 336 
7*51 357 
3461 







The simulated results are shown in the table above. Ploaoo notoN ote the differences of the 
two columns labeled with "data duplication/'T The first used xmap to place the IRAMs, 
5 while in the second^ the IRAMs were placed by hand using a greedy algorithm which places 
IRAMs that are operands of the same operator in one line (as long as this is possible). The 
second solution improved the iteration cycles by 18. This shows that IRAM placement has a 
great impact to the final performance. 

1 0 The traditional unroU-and-jam algorithm uses loop peeling to split the outer loop in a preloop 
and an unroU-able main loop to handle odd loop counts. When^we-assume^. for instance^ n 

128 _. 

=128 is assumed, t he vinroll-and-jam factor would calculate to c„„„„.^_ja„ = — - 4. 

Since the outer loop count (14) is not a multiple of 4, the algorithm virtually peels off the first 
1 5 two iterations and fiises the two loops at the end adding guards to the inner loop body. Then 
the code looks l4fc<>ap proximatelv as follows (guards emphasized)^ 

for(v=0; v<=VERLEN-5; v+=4) { 

for(h=0; h<=H0RLEN-3 ; h++) { 
20 p2 [v+1] [h+1] = min(abs( (pi [v+2] [h] - pi [v] [h] ) + 



NYOI 1113556V1 



107 



MARKED -UP VERSION OF THE 
SUBSTITtJTE SPECIFICATION 



(pi [v+2] [h+2] -pl[v][h+2]) + 
2 * (pi [v+2] [h+1] - pl[v][h+l])) + 
abs ( (pi [v] [h+2] -pl[v][h]) + 

(pi [v+2] [h+2] - pi [v+2] [h] ) + 
2 * (pi [v+1] [h+2] - pi [v+1] [h] ) ) , 255); 
p2[v+2][h+l] = min(abs( (pi [v+3] [h] - pi [v+1] [h] ) + 

(pl[v+3] [h+2] - pi [v+1] [h+2] ) + 
2 * (pi [v+3] [h+1] - pi [v+1] [h+1])) + 
abs ( (pi [v+1] [h+2] - pi [v+1] [h] ) + 

(pi (v+3] [h+2] - pi [v+3] [h] ) + 
2 * (pi [v+2] [h+2] - pi [v+2] [h] ) ) , 255); 
ir('v>0;p2-[v+3] [h+1] = min(abs( (pi [v+4] [h] - pi [v+2] [h] ) + 

(pl[v+4] [h+2] - pi [v+2] [h+2]) + 
2 * (pi [v+4] [h+1] - pi [v+2] [h+1] ) ) + 
abs ((pi [v+2] [h+2] - pi [v+2] [h] ) + 

(pi [v+4] [h+2] - pi [v+4] [h] ) + 
2 * (pi [v+3] [h+2] - pi [v+3] [h] ) ) , 255); 
if(v>l) p2[v+4] [h+1] = min(abs( (pi [v+5] [h] - pi [v+3] [h] ) + 

(pi [v+5] [h+2] - pi [v+3] [h+2] ) + 
2 * (pi [v+5] [h+1] - pi [v+3] [h+1] ) ) + 
abs ((pi [v+3] [h+2] - pi [v+3] [h] ) + 

(pi [v+5] [h+2] - pi [v+5] [h] ) + 
2 * (pi [v+4] [h+2] - pi [v+4] [h] ) ) , 255) ; 

} 

} 



— Parameterized Function 
Source code 

The benchmark source code is not very likely to be written in that form in real world 
applications. Normally, it would be encapsulated in a function with parameters for input and 
output arrays along with the sizes of the picture to work on. 

Therefore the source code would look similar to the following: 

void edge3x3(int *pl, int *p2 , int HORLEN, int VERLEN) 
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{ 

for {v=0; v<=VERLEN-3; v++) { 
for (h=0; h<=H0RLEN-3; h+ + ) { 
htmp = (** (pi + (v+2) * HORLEN + h) - ** (pi + v * HORLEN 
5 + h)) + 

(**(pl + (v+2) * HORLEN + h+2) - ** (pi + v * HORLEN + 

h+2)) + 

2 * (**(pl + (v+2) * HORLEN + h+1) - ** (pi + v * HORLEN 
+ h+D) ; 
10 if (htmp < 0) 

htmp = htmp; 

vtmp = (**(pl + v * HORLEN + h+2) - ** (pi + v * HORLEN + 

h)) + 

(**(pl + (v+2) * HORLEN + h+2) -** (pi + (v+2) * 
15 HORLEN + h) ) + 

2 * (**(pl + (v+1) * HORLEN + h+2) -** (pi + (v+1) * 
HORLEN + h) ) ; 

if (vtmp < 0) 

vtmp = vtmp ; 
20 sum = htmp + vtmp; 

if (sum > 255) 
sum = 2 55; 
** (p2 + (v+1) * HORLEN + h+1) = sum; 

} 

25 } 
} 

This requires some additional features from the compiler. 
• m — interprocedural optimizations and analysis 
30 • m — hints by the Programmer^ (e.g.^ a compiler known assert(VERLEN % 2 == 0) makes 
unroU-and-jam actually possible without peeling off iterations and running them 
conditionally). 
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Fitting the Algorithm Optimally to the Array 

Since HORLEN and VERLEN are not known at compile time these unknown parameters 
introduce some constraints which prevent pipeline vectorization. The compiler must assume 
that the IRAMs cannot hold all HORLEN input values in a row, so pipeline vectorization 
would not be possible. 

Strip Mining Inner Loop 

Strip mining partitions the inner loop into a loop that runs over a strip, which is chosen to be 
of the same size as the IRAMs can hold and a by strip loop iterating over the strips. Of 
couroo thoThe strip loops upper bound must be adjusted for the possible incomplete last strip. 
After the strip minings the original code would U^ftkOifeehe approximately as follows (outer v- 

loop neglected): 

for (h=0; h<=H0RLEN-3; h+= stripsize) 

for (hh=h; h<=rain (h+stripsize— 1 , HORLEN-3) ; hh+ + ) { 
htmp = (* * (pi + (v+2 ) * HORLEN + hh) - * * (pi + v * 
HORLEN + hh) ) + 



} 



} 



Assummg aan IRAM size strip size of 3§€256» the following simulated results cart be 
obtained for one strip. The values must be multiplied with the number of strips to be 
calculated. 



Parameter 


Value (shift register 
synthesis) 


Value (data duplication - with 
IRAM placement) 


Vector length 


16 


16 


Reused data set size 


256 


256 


I/O IRAMs 


4I + 20 = 6 


12I + 20= 14 


ALU 


45 


37 


BREG 


31 (12 defined + 19 route) 


42 (4 defined + 38 route) 


FREG 


29 (1 defined + 28 route) 


18 (1 defined + 17 route) 


Data flow graph width 


14 


14 


Data flow graph height 


3 (shift registers) + 8 
(calculation) 


8 (calculation) 


Configuration cycles 
(simulated) 


configuration 2753 
preloads 7*4*64 1792 
cycles 128*530 67840 


configuration 2754 
preloads 7*12*64 5376 
cycles 128*553 70784 
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sum 



72385 sum 



78914 



The RISC DSP needs about 1 .47 million cycles for this amovmt of data. As mentioned above^ 
these values do not include cache miss penalties and truly underestimate the real values. 
Furthermore^ it can be seen that data duplication does not improve the performance. The 
reason for this seems to be aaa worse placement and— routing. 



§r2 FIR Filter 

5,2,4 — Original Code 

Source code: 
10 ttdefine N 256 

#define M 8 



15 



for (i = 0; i < N-M+1; i++) { 
S: y[i] -= 0; 

for (j =0; j < M; j++) 
S' : y[i] += c[j] * x[i+M-j-l); 

} 



The constants iV and M are replaced by their values by the pre-processor. The data 
20 dopond e nc e dependencv graph is tho fn11owinp:shovyn in Fig. 31. 
for (i = 0; i < 269; i++) { 
S: y[i] = 0; 

for (j = 0; j < 8; j++) 
S': y[i] +=c[j] * X [i+7-j ] ; 

25 } 



Wo hav e th eThe following is a corresponding t able: 



Parameter 


Value 


Vector length 


269 


Reused data set size 




yo IRAMs 


3 


ALU 


2 


BREG 


0 


FREG 


0 


Data flow graph width 


1 
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Data flow graph height 


2 


Configuration cycles 


2+8=10 



§,2,2 — First Solution 

In fliea case w«»-wan tin which it is desired to save memory, ^a straightforward solution is to 
unroll the inner loop and to use shift register synthesis to delay the values of array x in the 
pipeline. No other optimization is applied before as either they do not have an effect on the 
loop or they increase the need for IRAMs. After loop unrolling, we obtain t he following code 
is obtained : 

for (i = 0; i < 269; i++) { 



Y[i] 


= 0; 




Y[i] 


+ = 


C[0] 


* x[i+7] ; 


Y[i] 


+ = 


c[l] 


* X [i+6] ; 


Y[i] 


+ = 


c[2] 


* x[i+5] ; 


y[i] 


+ = 


c[3] 


* x[i+4] ; 


Y[i] 


+ = 


C[4] 


* x[i+3] ; 


Y[i] 


+ = 


c[5] 


* x[i+2] ; 


Y[i] 


+ = 


c[6] 


* x[i + l] ; 


Y[i] 


+ = 


c[7] 


* x[i] ; 



} 



Thon th e table looks like this: 

The following is a corresponding table: 



Parameter 


Value 


Vector length 


269 


Reused data set size 




I/O IRAMs 


9 


ALU 


16 


BREG 


0 


FREG 


0 


Data flow graph width 


2 


Data flow graph height 


9 


Configuration cycles 


9+269=278 



Dataflow analysis reveals thaty[0J=f(x[0J—^^x[7J, y[l]-f(x[l]—^^x[8])—^ 
y[i]=f(x[i],T> . . .x[i^7]). Successive values of depend on almost the same successive 
values of x. To prevent unnecessary accesses to the IRAMs, the values ofx needed for the 
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computation of the next values of are kept in registers. In ewthis case^ this shift register 
synthesis needs 7 registers. This will be achieved on the PACT XPPt by keeping them krtein 
FREGs. Then wo obtain t he dataflow graph H n piVt n H hnlmv n f Fi g. 32 is obtained. An 
IRAM is used for the input values and an IRAM for the output values. The first 8 cycles are 
used to fill the pipeline and then the throughput is of one output value/cycle. Wo can depict 
^The code mav be represented as the followin g follows : 



rO = 


x[0] 


rl = 


x[l] 


r2 = 


x[2] 


f3 = 


x[3] 


r4 = 


X [4] 


r5 = 


X [5] 


r6 = 


X [6] 


r? = 


X [7] 


for 


(i = 



y[i] = c7 *rO + c6*rl + c5*r2 + c4*r3 + c3*r4 + c2*r5 + 



*r6 


+ 


c0*r7 ; 


rO 




rl; 


rl 




r2; 


r2 




r3; 


r3 




r4 ; 


r4 




r5; 


r5 




r6; 


r6 




r7; 


r7 




x[i+7] ; 



} 



TheA final table is shown below, and the expected speedup with respect to a standard 
superscalar processor with 2 instructions issued per cycle is 13.6. 



Parameter 


Value 


Vector length 


269 


Reused data set size 




1/OlRAMs 


2 


ALU 


16 
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0 


FREG 


7 


Data flow graph width 


3 


Data flow graph height 


9 


Confieuration cycles 


8+269=277 




Ops 


Number 


LD/ST (2 cycles) 


2 


ADDRCOMP (1 cycle) 


0 


ADD/SUB (1 cycle) 


8 


MUL (2 cycles) 


^8 


SHIFT (1 cycle) 


0 


Cycles per iteration 


28 


Cycles needed for the loop (2-way) 


(28*269)72=3766 



Variant with Larger Loop Bounds 
5 Lot UP talcelaking larger loop bounds and setsetting the values of and M to 1 024 and 64^ 

64: 

for (i = 0; i < 961; i++) { 
y [i] = 0; 

for (j = 0; j < 64; j++) 
10 y[i] += c[j] * x[i+63-j]; 

} 

Following the loop optimizations driver given before, w o apply loop tiling is applied to 
reduce the iteration range of the inner loop. Wo obtain th eThe following loop nestjs 
15 obtained : 

for (i = 0; i < 961; i++) { 
y[i] = 0; 

for (jj = 0; jj < 8; jj++) 
for (j = 0; j < 8; 
20 y[i] ++ c[8*jj+j] * x[i+63-8*jj-j] ; 

} 
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A subsequent application of loop unrolling on the inner loop yields: 
for (i = 0; i < 961; i++) { 
y[i] = 0; 

for (jj = 0; jj < 8; jj+ + ) { 



y[i] 


+= 


c [8*j j] 


* 


X [i+63- 


-8*jj] ; 


Y[i] 


+= 


c [8*j 


* 


X [i+62- 


-8*jj] ; 


y[i] 


+= 


c [8*jj+2] 


* 


X [i+61- 


-8*jj] ; 


y[i] 


+= 


c [8*j j+3] 


* 


X [i+60- 


-8*jj] ; 


y[i] 


+ = 


c [8*j j+4] 


* 


x[i+59- 


8*jj] ; 


y[i] 


+ = 


c [8*j j+5] 


* 


x[i+58- 


8*jj] ; 


y[i] 


+= 


c [8*j j+6] 


* 


X [i+57 


--8*jj] ; 


y[i] 


+ = 


c [8*j j+7] 


* 


X [i+56 


-8*jj ] ; 



} 



Finally w c obtain ^ the same dataflow graph as above is obtained , except that the coefficients 
must be read from another IRAM rather than being-directly handled like constants by the 
multiplications. After shift register synthesis^ the code ismay be the following: 
for (i = 0; i < 961; i++) { 
rO = X [i+56] ; 
rl = x[i + 57] ; 
r2 = X [i+58 ] ; 
r3 = X [i+59] ; 
r4 = X [i+60] ; 
r5 = x[i + 61] ; 
r6 = X [i+62] ; 
r7 = X [i + 63] ; 
for (jj = 0; jj < 8; jj++) 

Y[i] = c[8*jj]*r0 + c[8*jj+l]* rl + c[8*jj+2]*r2 + 
c[8*jj+3]*r3 + c[8*jj+4]*r4 + c[8*jj+5]*r5 + 
c[8*jj+6]*r6 + c [8*j j+7] *r7; 
rO = rl; 

rl = r2; 

r2 = r3; 

r3 = r4; 



NY01 1113556V1 



115 



MARKED-UP VERSION OP THE 
SUBSTITUTE SPECIFICATION 



r4 
r5 
r6 
r7 



= r5; 
= r6; 
= r7; 

= x[i+63-8*j j ] ; 



} 



10 



The following t able is the same than bofor e as above except for the vector length and the 
expected speedup with respect to a standard superscalar processor with 2 instructions issued 
per cycle is 17.5. 



Parameter 


Value 


Vector length 


8 


Reused data set- size 




I/O IRAMs 


2 


ALU 


16 


BREG 


0 


FREG 


7 


Data flow graph width 


3 


Data flow graph height 


9 


Confieuration cycles 


8+8 =16 




Ops 


Number 


LD/ST (2 cvcles^ 


10 


ADDRCOMP (1 cycle) 


0 


ADD/SUB (1 cycle) 


16 


MUL (2 cycles) 


17 


SHIFT (1 cycle) 


0 


Cycles per iteration 


70 


Cycles needed for the loop (2-way) 


(70*8)/2=280 



15 



20 



5i2i3 — More Parallel Solution 

The solution we-presented_aboye does not expose a lot of parallelism in the loop. Wo can try 
teTo explicitly parallelize the loop before wo gonerateeenerating the dataflow graph^-Of 
rniirnn mrpnr.in p ran he tried. ExDOsiug morc parallelism meaftsmay mean more pressure on 
the memory hierarchy. 

In the data dependence graph presented at the boginning abgye, the only loop-carried 
dependence is the dependence on and it is only caused by the reference to y[i]. Hence-we 
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^ply^ node splitting is applied t o get a more suitable data dependence graph. Wo obtain 
fee nAccordinglv. the following may be obtained : 
for (i = 0; i < 249; i++) { 
y[i] = 0; 

5 for (j = 0; j < 8 ; j++) 

{ 

tmp = c[j] * X [i+7-j ] ; 
y[i] += tmp; 

} 

10 i 



Then scalar expansion i smav be performed on tmp to remove the anti loop-carried 
dependence caused by it, and we hav e t he following code mav be obtained : 
for (i = 0; i < 249; i++) { 
15 Y[i] = 0; 

for (j = 0; j < 8; j+ + ) 

{ 

tmp[j] = c[j] * x[i + 7-j]; 
Y[i] += tmp[j] ; 

20 } 
} 



The parameter table is the following: 



Parameter 


Value 


Vector length 


249 


Reused data set size 




I/OIRAMs 


3 


ALU 


2 


BREG 


0 


FREG 


1 


Data flow graph width 


2 


Data flow graph height 


2 


Configuration cycles 


2+8=10 
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Then wo apply loopLooE distribution mav then be applied t o get a vectorizable and a not 
vectorizable loop. 

for (i = 0; i < 249; i++) { 
Y[i] = 0; 

5 for (j = 0; j < 8; j++) 

tmp[j] = c[j] * x[i + 7-j]; 
for (j = 0; j < 8; j++) 
y [i] += tmp [ j ] ; 

} } 

10 

The following parameter table given below corresponds to the two inner loops in order to be 
compared with the preceding table. 



Parameter 


Value 


Vector length 


249 


Reused data set size 




I/OIRAMs 


5 


ALU 


2 


BREG 


0 


FREG 


1 


Data flow graph width 


1 


Data flow graph height 


3 


Configuration cycles 


1*8+1*8=16 



15 Th e n wr mi nt tn lr r inti" n^""""* th a T hg grrhitprtiire ma y he taken into account . The first 
loop is fully parallelrfiHs^wiiich means that we would need 2*8=16 input values at a time. 
This is all right, as it corresponds to the number of IRAMS of the PACT XPP. Hence-we^ 
not nood:, to strip-mine the first inner inr%p Thn mr.n nf is not required. To strip-mine the 
second loop is trivial, it does not, nood to bo strip mined oithor. also not required. The second 

20 loop is a reductionT4t. It computes the sum of a vector. This ismay be easily found by the 
reduction recognition optimization and wo obtain t he following code may be obtained, 
for (i = 0; i < 249; i++) { 
y[i] = 0; 

for (j = 0; j < 8; j++) 
25 tmpEjl = c[j] * x[i+7-j]; 
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/* load the partial sums from memory using a shorter 
vector length */ 

for (j = 0; j < 4; j+ + ) 

aux[j] = tmp[2*j] + tmp[2*j+l] ; 

5 

/* accumulate the short vector */ 
for (j = 0; j < 1; j+ + )T 

aux[2*j] = aux[2*j] + aux[2*j+l] ; 

10 /* sequence of scalar instructions to add up the partial 

sums * / 

y[i] = aux[0] + aux(2]; 

} 

1 5 Like above we giv e^ only one table is given below for all innermost loops and the last 
instruction computing 3;/^r7- 



Parameter 


Value 


Vector length 


249 


Reused data set size 




I/O IRAMst 


12 


ALU 


4 


BREGt 


0 


FREG 


0 


Data flow graph width 


1 


Data flow graph height 


4 


Configuration cycles 


1*8+1*4+1*1=13 



Finally^ loop- unrolling i smav be applied on the inner loops^-fe e. The number of operations is 
20 always less than the number of processing elements of the PACT XPP. 
for (i = 0; i <961; i++) 

{ 

tmp [0] = c [0] * x[i+7] ; 

ttnptl] = c [1] * x[i+6] ; 

25 tmp [2] = c[2] * X [i+5] ; 

tmp [3] = c [3] * x[i+4] ; 

tmp [4] = c [4] * x[i+3] ; 
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tmp[5) = c [5] * x[-i+2] ;t 
tmp[6] = c[6] * x[i + l] ; 

tTnp[7] = c[7] * x[i] ; 



aux[0] = tmp[0] + tmp[l]; 
aux [1] = tmp [2] + tmp[3] ; 
aux[2] = tmp [4] + tmp[5l; 
aux [3] = tmp [6] + tmp [7] ; 



10 aux [0] = aux [0] + aux [1] ; 

aux[2] = aux [2] + aux[3] ; 



15 



y[i] = aux[0] + aux[2]; 

} 

Wo obtain thon tho follomngl M dataflow graph illustrated in Fig. 33. representing the inner 
loop , may be obtained . 

It could be mapped on the PACT XPP with each layer executed in parallel, thus 
20 needtftg reauiring 4- cycles/iteration and 1 5 ALU-PAEs, 8 of which are.needed in parallel. 
As the graph is already synchronized, the throughput reaches one iteration/cycle^ after 4 
cycles to fill the pipeline. The coefficients are taken ^as constant inputs by the ALUs 
performing the multiplications. 

25 TbeA drawback of this solution i smav be that it uses 1 6 IRAMs, and that the input data must 
be stored in a special order. The number of needed IRAMs can be reduced if the coefficients 
are handled like constant for each ALU. But due to data locality of the program, weit can 
assam ebe assumed that the data already reside in the cache. And as As the transfer of data 
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from the cache to the IRAMs can be achieved efficiently, the configuration can be executed 
on the PACT XPP without waiting for -the data to be ready in the IRAMs. TheAccordingly, 
the parameter table i&4he ftmav be the following: 



Parameter 


Value 


Vector length 


249 


Reused data set size 




I/O IRAMs 


16 


ALU 


15 


BREG 


0 


FREG 


0 


Data flow graph width 


8 


Data flow graph height 


4 


Configuration cycles 


4+961 



5 

Variant with Larger Bounds 

To make the things a bit more interesting, we-se tin one case, the values of N and M were set 
to 1024 and 64. 

for (i = 0; i < 961; i++) { 
10 y[i] = 0; 

for (j =0; j < 64; j+ + ) 

y [i] += c[j] * x[i+63-j] ; 

} 

1 5 The data dependence graph is the same as above. Wo applvNode splitting may then node sp 
lifiin ^be applied to get a more convenient data dependence graph, 
for (i = 0; i < 961; i++) { 
y[i] = 0; 

for (j = 0; j < 64; j++) 
20 { 

tmp = c[j] * x[i + 63-j]; 
y[i] += tmp; 

} 

} 

25 
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After scalar expansion: 

for (i = 0; i < 961; i++) { 
y[i] = 0, 

for (j = 0; j < 64; j+ + ) 
5 { 

tmp[j] = c[j] * x[i + 63-j ] ; 
Y[i] += tmp [j]_; 

} 

} 

10 

After loop distribution: 

for (i = 0; i < 961; i++) { 
y[i] = 0; 



15 for (j = 0; j < 64; j++) 

tmp[j] = c[j] * x[i+63-j]; 
for (j = 0; j < 64; j+ + ) 



y [i] += tmp [j] ; 

20 }} 

We-ge After going through the compiling process, and wo arrive to the set of optimizations 
that depends upon architectural p^r^mpter^ W o want mav be arrived at. It might be desired 
to split the iteration space, as too many operations would havcv to be performed in parallel, if 
25 wo keep itiskept as such. Hence wo perform^ strip-mining mav be performed on the 2 loops. 
Wc can only accessOnly 16 dat a can be accessed at a time, so, because of the first loop, the 
factor will be 64 * 2/16 = 8 for the 2 loops (as wo always have in mind that wo want iUs 
desired to execute both at the same time on the PACT XPP). 
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10 



for (i = 0; i < 961; i++) { 
y[i] = 0 

for (jj = 0; jj < 8; jj++) 
for (j = 0; j < 8; j++) 

tmp[8*j j-+j] = c[8*jj+j] * x[i+63-8*jj-j] ; 
for (jj = 0; jj < 8; jj++) 
for (j= 0; j < 8; j++) 
y[i] += tmp[8*j j+j] ; 

} 

Now w e apply r e duction r e cognition on the s e cond innormoot loop. 



Then, loop fusion on the // loops may be performed. 
for (i = 0; i < 961; i++) { 
15 y[i] = 0; 

for (jj = 0; jj < 8; jj++) { 
for (j = 0; j < 8;j+ + ) 

tmp[8*jj+j] = c[8*jj+j] * x[i + 63-8*jj-j] ; 
for (j = 0; j < 8; j++) 
20 y[i] += tmp[8*jj+j]; 

} 

} 

Reduction recognition mav then be applied on th e second innermost loop. 
25 for (i = 0; i < 961; i++) { 
tmp = 0 ; 

for (jj = 0; jj < 8; jj++) 
{ 

for (j = 0; j < 8; j+ + ) 
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tmp [8*jj+j] = c [8*jj+j ] * x[i+63-8*jj-j] ; 



/* load the partial sums from memory using a shorter vector 
length */ 

for (j = 0; j < 4; j++) 
aux[j] = tmp [8*jj+2*j ] + tmp [8*jj+2*j+l] ; 

/* accumulate the short vector */ 

forj (j = 0; j < 1; j++) 
aux[2*j ] = aux [2*j ] + aux[2*j+l] ; 

/* sequence of scalar instructions to add up the partial 

sums */ 

y[i] = aux[0] + aux [2] ; 

And thon loop LooE unrollin g may then be perfonned : 
for (i = 0; i < 961; i++) 

for (jj = 0; jj < 8; jj++) 

{ 

tmp[8*jj] = c[8*jj] * x[i+63-8*j j] ; 

tmp[8*jj+l] = c[8*jj+l] * x[i+62-8*jj] ; 
tmp[8*jj+2] = c[8*jj+2] * x [i+61-8* j j ] ; 
tmp[8*jj+3] = c[8*jj+3] * x [i+59-8* j j ] ; 
tmp[8*jj+4] = c[8*jj+4] * x [i+58-8* j j ] ; 
tmp[8*jj+5] = c[8*jj+5] * x [i+57 -8* j j ] ; 
tmp[8*jj+6] = ct8*jj+6] * x [i+56-8*j j ] ; 
tmp[8*jj+7] = c[8*jj+7] * x[i+55-8*jj] ; 

aux [0] = tmp[8*jj] 4 + tmp [8* j j +1] ; 

aux [1] = tmp [8*jj+2] + tmp [8*jj+3] ; 

aux [2] = tmp[8*jj+4] + tmp[8*jj+5]; 

aux [3] = tmp [8*jj+6] + tmp [8*jj+7] ; 

aux [0] = aux [0] + aux [1] ; 
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aux[2] 



= aux[2] + aux[3]; 



y[i] = aux[0] + auxt2]; 

} 

5 

Wc implomont theTTie innermost loop mav be implemented on the PACT XPP directly with a 
counter. The IRAMs « ^mav be used in FIFO mode, and filled according to the addresses of 
the arrays in the loop. IRAMO, IRAM2, IRAM4, IRAM6 and IRAMS contain array V. 
IRAMl, IRAMS, IRAMS and IRAM7 contain array V. Array V contains 64 elements, ^ 
10 i s/.e.. each IRAM contains 8 elements. Array 1x1 contains 1 024 elements, that ia U^ 128 

elements for each IRAM. Array lyl is directly written to memory, as it is a global array and 
its address is constant. This constant is used to initialize the address counter of the 
configuration. TheA final parameter table is the following: 



Parameter 


Value 


Vector length 


8 


Reused data set size 




I/O IRAMs 


16 


ALU 


15 


BREG 


0 


FREG 


0 


Data flow graph width 


8 


Data flow graph height 


4 


Configuration cycles 


4+8=12 



15 

Nevertheless^, it should be noted that this version should be less efficient than the previous 
one. As the same data must be loaded in the different IRAMs from the cache, we4^avethere 
are a lot of transfers to aebiev ebe achieved before the configuration can begin the 
computations. This overhead must be taken into account by the compiler when choosing the 
20 code generation strategy. This means also that the first solution is the solution that will be 
chosen by the compiler. 



— Other Variant 

Source Code 

25 for (i = 0; i < N-M+1; i++) { 

tmp = 0; 

for (j = 0; j < M; j+ + ) 
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tmp += c[j] * x[i+M-j-l]; 
X [ i ] = tmp ; 

} 

In this case, it is trivial that the data dependence graph is cyclic due to dependences on tmp. 
Therefore^, scalar expansion is applied on the loop, and wo obtain^ in fact^ the- same code as 
the first version of the FIR filter is obtained a s shown below, 
for (i = 0; i < N-M+1; i++) { 
tmp[i] = 0; 
for (j = 0; j < M; j+ + ) 

tmp[i] += c[j] * x[i+M-j-l]; 
X [i] = tmp [i] ; 

} 

5^3 Matrix Multiplication 

— Original Code 

Source code: 
#define L 10 
#define M 15 
#define N 20 
int A [ L] [Ml ; 
int B [M] [N] ; 
int R [L] [N] ; 

main ( ) { 

int i, j, tmp, aux; 

/* input A (L*M values) */ 
for (i=0; i<L; i++) 

for {j=0; j<M; j++) 

scanf {"%d" , ScA[i] [j] ) ; 

/* input B (M*N values) */ 
for (i=0; i<M; i++) 
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for {j=0; j<N; j+ + ) t 

scanf ("%d" , ScB[i] [j] ) ; 



/* multiply */ 
for (i=0; i<L; i++) 
for {j=0; j<N; j++) { 
aux = 0 ; . 

for (k=0; k<M; k++) 

aux += A[i] [k] * B[k] [j] ; 
R[i] [j] = aux; 

} 

/* write data —stream */- 
for (i=0; i<L; i++) 

for {j=0; j<N; j++) 

printf ("%d\n", R [i] [j]); 

} 

5,3,3 — Preliminary Transformations 

Since no inline-able function calls are present, no interprocedural code movement is done^^ 

Of the four loop nests, the one with the "/* multiply */" comment is the only candidate for 
running partly on the XPP. All others have function calls in the loop body and are therefore 
discarded as candidates very early in the compiler. 

Dependency Analysis 

for (i=0; i<L; i++) 

for (j=0; j<N; j+ + ) { 

51 aux = 0; 

for (k=0; k<M; k++) 

52 aux += A[i] [k] * B[k] [j] ; 

53 R[i] [j] = aiox; 
} 
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Fitrur^ fif) Data Fie.. 34 shows a data dependency graph for matrix multiplication^The data 
dependency graph shows no dependencies that prevent pipeline vectorization. The loop 
carried true dependence from S2 to itself can be handled by a feedback of aux as described in 
fq ^Markus Weinhardt et al.. "Memory Access Optimization for Reconfigurable Systems," 
5 supra. 

Reverse Loop-Invariant Code Motion 

To get a perfect loop nes t wo move^ SI and S3 mav be moved i nside the k-loop. Therefore^ 
appropriate guards af emav be generated to protect the assignments. The code after this 
10 transfnmriation looks lik e is as follows: 
for (i=0; i<L; i++) 

for(j=0; j<N; j++) 

for (k=0; k<M; k++) { 

if (k == 0) aux[j] = 0; 
15 aux[j] += A[i] [k] * B [k] [j ]; 

if (k == M-1) R[i] [j ] = aux[j]; 

} 

Scalar Expansion 

20 ©wA goal i smavbe to interchange the loop nests to improve the array accesses to utilize the 
cache best. T Jnfnrtunatelv However. the guarded statements involving lauxlmay cause 
backward loop carried anti dopondonc e g dependencies carried by the j loop. Scalar expansion 
vi4ttmay break these dopond e nce s dependencies, allowing loop interchange, 
for (i=0; i<L; i++ ) 

25 for (j=0 ; j<N; j++) 

for (k=0; k<M; k++) { 

if (k == 0) aux[j] = 0; 

aux[j] += A[i] [k] * B[k] [j] ; 

if (k == M-1) R[i] [j] = aux [j]; 

30 } 
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Loop Interchange for Cache Reuse 

Visualizing the main loop shows the iteration spaces for the array accesses. (Figure 61) Fig, 
35 is a visualization of array access sequences . Since C arrays are placed in row major order^ 
the cache lines are placed in the array rows. At first sights there seems to be no need for 

5 optimization because the algorithm requires at least one array access to stridefstride over a 
column. Nevertheless^ this assumption misses the fact that the access rate is of interest, too. 
Closer examination shows that array R is accessed in every j iteration, while B is accessed 
every k- iteration, always producing a cache miss^. r"aux" is no t currently discussed since it 
is not expected that it would be written to or read from memory, as ther e are no defs or uses 

1 0 outside the loop nest.) T his leaves a possibility for loop interchange to improve cache access 
as proposed by Kennedy and Allen in fT ^Markus Weinhardt e t al., "Pipeline Vectorization," 
supra. 

Figure 61 The visualized array access sequences, 

1 5 Pindm gTo find the best loop nes t is relatively simple. The„ ,ttie algorithm simply 
intorchang e s mav interchange each loop of the nests into the innermost position and 
annotat es annotate it with the so-called innermost- memory cost term. This cost term is a 
constant for known loop bounds or a function of the loop bound for unknown loop bounds. 
The term is may be calculated in three steps. 

20 • «— Firsts the cost of each reference^ in the mnermost loop body i smay be calculated toi 

• • 1, if the reference does not depend on the loop induction variable of the (current) 

innermost loop; 

• »the loop count, if the reference depends on the loop induction variable and strides 

over a non-contiguous area in -vyith respect of the cache layout; 

25 • • N ^ '\ixh& reference depends on the loop induction variable and strides over a 

b 

contiguous dimension. In this case^ N is the loop coxmt, s is the step size and b is 
the cache line size, respectively. 

2 W o nogloct "aux" in this obs e rvation sinco wo do not expect it to be written to or road from m e mory 

(no d e fs or ug q g outside the loop n e st) 

* Ref e r e nc e moans acc e ss to an array in this caso. Sinco tho transformation wants to optimiz e cache 

access, it must address ref e roncos to th e same array within small distances as ono. This prohibits over estimation 
of tho actual costs. 
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Tn this case> a ''reference" is an access to an array. Since the transformat ion attempts to 
o ptimize cache access, it must address references to the same array within small distances 
one. This mav prohibit over-estimation of the actual costs. 

5 • fl Second^ each reference cost i smay be weighted with a factor for each other loop, 

which is: 

• • 1, if the reference does not depend on the loop index; 

• wthe loop count, if the reference depends on the loop index. 

1 0 • Thirds the overall loop nest cost i smay be calculated by summing the costs of all 
reference costsv. 

After invoking this algorithm for each loop as the innermost, the one with the lowest cost 
i smay be chosen as the irmermost, the next as the next outermost, and so on. 



Innermost 
loop 


R[i]D] 


A[i][k] 


B[k]D] 


Memory access cost 


k 


l-L-N 


b 


M-N 


L.N + K.L + M-N 
b 


i 


l-L-N 


l-L-M 


\-M-N 


L-N+L-M+M-N 


j 


b 


LM 


b 


b 



15 

Tabic 1 Loop memory acccas costs for the different loops being innermost 

The preceding t able shows the values for the matrix multiplication. Since the j term is the 
smallest (of course assimiing b > 1 ), the j-loop is chosen to be the innermost. The next outer 
loop then is k, and the outermost is i. Thus^ the resulting code after loop interchange ismay 

20 be: 

for (i=0; i<L; i++ ) 

for (k=0; k<M; k++) , 

for (j=0; j<N; j++) { 

if (k == 0) aux[j] = 0; 
25 aux[j] += A[i] [k] * B [k] [j] ; 

if (k == M-1) R[il [j ] = aux[j]; 

} 
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Fi^7jrn 62 The visualizc d FieL, 36 shows the improved iteration spa c es. It shows array access 
sequences after optimization. Here thaT he improvement is visible to the naked eye,- smce 
array B is now read following the cache lines . Figure 62 ohowo the impro\^od iteration 
spaces. It is to oay that this This optimization does not optimize primarily for the XPP; but 
mainly optimizes the cache-hit rate, thus improving the overall performance. 

Unroll and Jam 

After improving the cache access behavior, the possibility for reduction recognition has been 
destroyed. This is a typical example for transformations where one excludes the other. 
Nevertheless wo obtain^ more parallelism mav be obtained b y doing unroU-and-jam. 
Therefore we unroll^ the outer loop mav be p artially unrolled with the unroll factor. This 
factor is mainly chosen by the minimum of two calculations: 

■ # available IRAMs / # used IRAMs in the inner loop body 

■ # available ALU resources / # used ALU resources in the inner loop. 

In this example embodiment, t he accesses to "A" and "B" depend on k (the loop which will 
beimroUed). Therefore^ they mus^be^ considered in the calculation. The accesses to 
"aux" and "R" do not depend on k. Thus^ they can be subtracted from the available IRAMs, 
but defldo not need to be added to the denominator. Therefor e wo calculate^ (assuming an 
XPP64) 14/2 = 7 is calculated for the unroll factor obtained by the IRAM resources. 

On the other hand^ the loop body involves two ALU operations (1 add, 1 mult), which 
yield smav yield an unrolling factor of approximately 64/2 = 32[rt]. 

4 ThiG iG a v e ry inaccurate OGtimation, sinc o it noithor cGtimates tho rooourc e G opont by tho controlling 

notvvork, which docr o aGeG th o unroll factor, nor takes it into account that e.g tho BREG PAEs also have an 
addor, vvhich incroaGOG tho unroll factor. Although it has no influcnco to this oxamplo tho unroll factor 
calculation of course has to account for this in a production compiler. 

(This is an inaccurate estimation since it neither estimates the resources spent by th e 
controlling network, which mav decrease the unroll factor, nor takes into account that, e.g., 
the BREG-PAEs also have an adder, which mav increase the unroll factor. Although it does 
not influence this example, the unroll factor calculation sh ould account for this in a 
production compiler.^ The constraint generated by the IRAMs therefore dominates by far. 
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Having chosen the unroll facto r vvc muot trim our ,_die loop trip count is trimmed to be a 
multiple of that factor. Since the k loop has a loop count of 1 5, wo pool off the first iteration 
may be peeled off and unroll t he remaining loo p mav be unrolled , 
for (i=0; i<L; i++) { 

for (k=0; k<l; k++) { 

for (j=0; j<N; j++) { 

if (k==0) aux[j] = 0; 

aux[j] += A[i] tk] *B[k][j]; 

if (k==M-l) R[i] [j] = aux[j]; 

} 

} 

for (k=l; k<M; k+=7) { 

for (j=0; j<N; j++) { 

if (k==0) aux[j] = 0; 

aux[j] += A[i] [k] * B [k] [j]; 

if (k==M-l) R[i3 [j] = aux[j]; 

} 

for (j=0; j<N; j++) { 

if (k+l==0) aux[j] = 0; 

aux[j] += A[i] [k+1] * B[k+1] [j]; 

if (k+l==M-l) R[i] [j] = aiix[j]; 

} 

for (j=0; j<N; j++) { 

if (k+2==0) aux[j] = 0; 

aux[j] += A[i] [k+2] * B[k+2] [j]; 

if (k+2==M-l) R[i] [j] = a\rx[j]; 
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} 

for (j=0; j<N; j++) { 

if (k+3==0) aux[j] = 0; 
aux[j] += A[i] [k+3] *B[k+3][j]; 
5 if (k+3==M-l) R[i] [j] = aux[j]; 

} 

for (j=0; j<N; j++) { 

if (k+4==0) aux[j] = 0; 
aux[j] += A[i] [k+4 ] * Btk+43-4] [j] ; 
10 if (k+4==M-l) R[i)-] [j] = aux[j]; 

} 

for (j=0; j<N; j++) { 

if (k+5 = = 0) a\ix[j] = 0; 
aux[j] += A[i] [k+5] * B[k+5] [j]; 
15 if (k+5==M-l) R[i] [j] = aux[j]; 

} 

for(j=0; j<N; j++) { 

if (k+6==0) aux[j] = 0; 
aux[j] += A[i] [k+6] * B [k+6] [j]; 
20 if (k+6==M-l) R[i] [j] = aux[j]; 

} 
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Due to ttio. fact that p lacement by the reverse loop invariant code motion plaeedof the loop 
invariant code into the inner loop^ which is new^uplicated seven times, it is very likely that 
dead code elimination can get rid of some of these duplicates. Thus^ the code i smay be 
shortened to: 

for (i=0; i<L; i++) { 

for (k=0; k<l; k++) { 

for(j=0; j<N; j++) { 

if (k==0) aux[j] = 0; 

aux[j] += A[i] [k] * B [k] [j]; 

) 

} 

for (k=l; k<M; k+=7) { 

for (j=0; j<N; j++) { 

aux[j] += A[i] [k] * B[k] [j]; 

} 

for (j=0; j<N; j++) { 

aux[j] += A[i] [k+1] * B[k+1] [j]; 

} 

for(j=0; j<N; j++) { 

aux[j] += A[i] [k+2] * B[k+2] [j]; 

} 

for (j=0; j<N; j++) { 

aux[j] += A[i] [k+3] * B[k+3] [j]; 

} 

for (j=0; j<N; j++) { 

aux[j] += A[i] [k+4] * B[k+4] [j]; 

} 

for (j=0; j<N; j++) { 

aux[j] += A[i] [k+5] * B[k+5] [j] ; 

} 

for {j=0; j<N; j++) { 

aux[jl += A[i] [k+6] *B[k+6][j]; 
if (k+6==M-l) R[i] [j] 
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} 

} 

} 



Before wej ^jamming of the inner Innps wo. hnv e to , it mav be taken into account for th e 
feet-that the first iteration of the k loop was peeled of which would produce an own 
configuration. Since wo ooloulatod t he unroU-and-jam factor is calculated to fit into one 
configviration, this side effect h^4 eshould be prevented. Because it should be no problem to 
run tiie k loop with^ variable step sizes, wo fiioo t he- k loops mav be fiised agai n and adjust^ 
the step size and guar dm av be adjusted, and the statements mav be giiarded . This yieWsmay 
yield: 

for (i=0; i<L; i++) { 

for (k=0; k<M; k+= k<l ? 1 : 7) { 
for (j=0; j<N; j++) { 

if (k==0) aux[j] = 0; 

if (k==0) aux[j] += A[i] [k] * B [k] [j]; 

} 

for(j=0; j<N; j++) { 

if {k>0) aux[j] += A[i] [k] * B [k] [ j ] ; 

} 

for(j=0; j<N; j++) { 

if (k>0) aux[j] += A[i] [k+1] *B[k+l][j]; 

} 

for(j=0; j<N; j++) { 

if (k>0) airx:[j] += A[i] [k+2] * B[k+2] [j]; 

} 

for(j=0; j<N; j++) { 

if (k>0} aux[j] += A[i] [k+3] * B[k+3] [j]; 

} 

for(j=0; j<N; j++) { 

if (k>0) aux[j] += A[i] [k+4] * B[k+4][j]; 

} 

for(j=0; j<N; j++) { 

if (k>0) aux[j] += A[i] [k+5] * B[k+5][j]; 
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) 

for(j=0; j<N; j++) { 

if (k>0) aux[j] += A[i] [k+6] * B[k+6][j]; 

if (k+6==M-l) R[il [j] = aioxLj]; 

} 

} 

} 



No wwocanjam^ the inner loops nnH finnllv nhtni n mav be jammed, and the following may be 
obtained. 

for (i=0; i<L; i++) { 

for (k=0; k<M; k+= k<l ? 1 : 7) { 
for (j=0; j<N; j++) { 

if (k==0) aux[j] = 0; 

if (k==0) a\ax[j] += A[i][k] * B[k][j]; 



if (k>0) { 










aux [ j ] 


+= 


A[i] [k] 


★ 


B [k] [ j ] ; 


aux [ j ] 


+= 


A[i] [k+1] 


* 


B[k+1] [j] 


aux [ j ] 


+= 


A[i] [k+2] 


* 


B[k+2] [j] 


aux [ j ] 


+= 


A[i] [k+3] 


★ 


B[k+3] [j] 


aux [ j ] 


+= 


A[i] [k+4] 


* 


B[k+4] [j] 


aux [ j ] 


+= 


A[i] [k+5] 


* 


B[k+5] [j] 


aux [ j ] 


+= 


A[i] [k+6] 




B[k+6] [j] 


if (k+6=- 


M-1) R[i] [j] 


= aux [ j ] ; 



} 

} 

} 

} 



— XPP Code Generation 
The innermost loop can be synthesized in a configuration^ which uses 14 IRAMs for the input 
data, one IRAM to temporary store aux, and one IRAM for the output may R. Furthermore, 
it i smav be necessary to pass the value of k to the XPP to direct the dataflow. This may be 
done by a streaming input. Figure 63 Fig^ shows the dataflow graph of the synthesized 
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rnnfignratinn Fifnjrc 63 Dataflow graph o f and shows matrix multiplication after imroll and 
jam. The rightmost 3 branches are ^mittpH — frnm the praph and event connections are 
omphasizcd in red co/or highliehted . 



The following code shows the pseudo code that mav be e xecuted on the RISC processor. 
XPPPreload( config ) 
for (i=0; i<L; i++) { 

XPPPreload(0, &A[i][0], M) 

XPPPreloadd, &A[il[0], M) 

XPPPreload(2, &A[i] [0] , M) 

XPPPreloadO, &A[i] [0] , M) 

XPPPreload(4, SLA[i][0], M) 

XPPPreload(5, &A[i][0], M) 

XPPPreload(6, &A[i][0], M) 

XPPPreloadClean(15, &R[i][0], M) 

for (k=0; k<M; k+= k<l ? 1 : 7) { 
XPPPreload(7, ScBtk][0], N) 
XPPPreload(8, &B [k+1] [0] , N) 
XPPPreloadO, &B[k+2] [0] , N) 
XPPPreloaddO, ScB[k+3] [0] , N) 
XPPPreloaddl, &B [k+4] [0] , N) 
XPPPreload(12 , 6cB [k+5] [0] , N) 
XPPPreload (13, &B [k+ 6 ] [ 0 ] , N) 

XPPExecute (config, IRAM(O), IRAM(l), IRAM(2), 

IRAM(3), IRAM(4), IRAM(5), 
IRAM(6), IRAM(7), IRAM(8), 
IRAMO), IRAM(IO), IRAM(ll), 
IRAM(12), IRAM(13), IRAM(15), k) 

} 



} 
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The following table shows the simulated configuration. The complete multiplication needs 
about 3120 cycles without the preloading and configuration. A typical RISC-DSP core with 
two MAC units and hard ware h ardware loop support needs over 26000 cycles (when data is 
5 in zero-latency internal memory). Although the time for preloads and cache misses is 

neglected here, the values pgemis eaccording to an embodiment of the present invention may 
result in improvements of 200-300 percent compared to a standalone RISC core. 



The following is a corresponding parameter table. 



Parameter 


Value 


Vector length 


20 


Reused data set size 


20 


I/OIRAMs 


14 1 + 1 O + 1 internal 


ALU 


20 


BREG 


26 (8 defined + 18 route) 


FREG 


28 (4 defined + 24 route) 


Data flow graph width 


14 


Data flow graph height 


6 (without routing and balancing) 


Configuration cycles (simulated) 


configuration 
preloads 

cycles 
sum 


2633 

10*3*7*5 

1050 

10*7*15 

1050 

(k= =0) 112 + 
(k==l) 100 + 
(k= =7) 100 
*10= 
3120 

7853 



10 



§j4 Viterbi Encoder 

SAA — Original Code 

Source Code: 

/* C-language butterfly */ 
15 #define BFLY{i) {\ 

unsigned char metric, mO, ml, decision; \ 
metric = ( (Branchtab2 9_l [i] ^ syml) + 

(Branchtab2 9_2 [i] ^ sym2) + l)/2; \ 
mO = vp->old _metrics [i] + metric; \ 
20 ml = vp->old_metrics [i+128] + (15 - metric) ; \ 



NY01 1113556V1 



138 



MARKED-UP VERSION OF THE 
SUBSTITUTE SPECIFICATION 



decision = (mO-ml) >= 0; \ 

vp->new_metrics [2*i] = decision ? ml : mO ; \ 
vp->dp->w[i/16l 1= decision << ( (2*i) Sc31) ; \ 

mO -= (metric+metric-15) ; \ 
5 ml += {metric+metric-15) ; \ 

decision = (mO-ml) >= 0; \ 

vp->new ^metrics [2*i+l] = decision ? ml : mO ; \ 
vp->dp->w[i/l6] 1= decision << { (2*i+l) &31) ; \ 

} 

10 

int update_viterbi2 9 (void *p, unsigned char syml , unsigned char 
SYm2) { 

int i ; 

struct v2 9 *vp = p; 
15 unsigned char *tmp; 

int normalize = 0; 



for (i= 0; i<8 ; i++) 
vp->dp->w [i] = 0 ; 

20 

for (i =0; i<128; i++) 
BFLY{i) ; 



/* Renormalize metrics */ 

25 if (vp->new_metrics [0] > 150) { 
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int i ; 

unsigned char minmetric = 255; 

for (i=0; i<64; i++) 
5 if (vp->new _metrics [i] < minmetric) 

minmetric = vp->new ^metrics [i] ; 
for (i=0 ; i<64 ; i++) 

vp->new _metrics [i] -= minmetric; 
normalize = minmetric; 

10 } 

vp->dp++; 

tmp = vp->old _metrics; 
vp->old_metrics = vp->new _metrics; 
15 vp->new_metrics = tmp; 

return normalize; 

} 

20 — Interprocedural Optimizations and Scalar Tr ansformations 

Since no inline-able function calls are present, in an embod iment of the present invention, no 
interprocedural code movement is done. 

After expression simplification, strength reduction, SSA renaming, copy coalescing and 
25 idiom recognition, the code Innkr . lilc e mav be approxi m ately as presented belovy (statements 
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are reordered for convenience). Note that idiom recognition wiHmay find the combination of 
minO and use ef-the comparison result for decision and ^decision. However^ the resulting 
computation cannot be expressed in C, so wo describ e it is described below as a commentT^ 
int update_viterbi2 9 (void *p, unsigned char syml , unsigned char 
5 sym2 ) { 

int i ; 

struct v2 9 *vp = p; 
unsigned char *tmp; 
int normalize = 0; 

10 

char *_vpdpw = vp->dp->w; 
for (i=0 ; i<8; i++ ) 
* _vpdpw _++ = 0; 

char *_bt2 9_l= Branchtab2 9_l ; 
15 char *_bt29_2= Branchtab2 9_2 ; 

char *_vpoTnO= vp->old_metrics; 
char *__vpoml28= vp->old_metrics+128 ; 
char * vpnm= vp- >new_me tries ; 
char *_vpdpw= vp->dp->w; 

20 

for (i=0; i<128; i++) { 

unsigned char metric, _tmp, mO, ml, _mO, _ml, decision, 
decision; 



25 metric = ( (* _bt29_l++ ^ syml) + 

(*_bt29_2++ ^ sym2) + l)/2; 
_tmp= (metric+metric-15) ; 
mO = *_vpom++ + metric; 
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ml = *_vpoml2 8++ + (15 - metric) ; 
_mO = mO - _tmp; 
_ml = ml + _tmp; 
// decision = mO >= ml; 
5 // ^decision = _mO >= ml; 

*_vpnm++ = min (mO, ml); // = decision ? ml : mO 

*__vpnm++ = min(_mO,_ml) ; // = _decision ? _ml : _mO 

__vpdpw[i >> 4] 1= ( mO >= ml) /* decision*/ << ( (2*i) & 

31) 

10 I (_mO >= _ml) /*_decision*/ << 

( (2*i+l) Sc31) ; 
} 

/* Renormalize metrics */ 
15 if (vp->new ^metrics [0] > 150) { 

int i ; 

unsigned char minmetric = 255; 

char *_vpnm= vp->new _metrics; 
20 for (i=0; i<64; i++) 

minmetric = min (minmetric, *vpnm++) ; 

char *_vpnm= vp->new ^metrics; 
for (i=0; i<64; i++) 
25 *vpnm++ -= minmetric; 

normalize = minmetric; 



vp->dp++; 
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tmp = vp->old _metrics; 

vp->old ^metrics = vp->new _metrics; 

vp->new_metrics = tmp; 

return normalize; 

} 

5^4^3 — Initialization 

The first loop (setting vp->dp->w[07-r-^70J7] to zero) i smav be most efficiently executed on 
the RISC. 

5,4^4 — Butterfly Loop 

The second loop (with the BFLYQ macro expanded) is of interest for the XPP compiler and 

needs further examination: 

char *iramO= Branchtab2 9_l ; // 

XPPPreload (0, Branchtab2 9 _1 , 128/4) ; 

char *iram2= Branchtab2 9_2 ; // XPPPreload (2, 

Branchtab2 9_2, 12 8/4) ; 

char *iram4= vp- >-old_metrics ; // XPPPreload (4, vp- 
>old metrics, 128/4) ; 

char *iram5= vp->old _metrics+128 ; // XPPPreload (5, 
vp->old _metrics+128, 128/4) ; 

short *iram6= vp- >— new_metrics ; // XPPPreload (6, vp- 
>new _metrics, 128/2) ; 

unsigned long *iram7= vp->dp->w; // XPPPreload (7, vp->dp- 

>w, 8) ; 

// syml Sc sym2 are in IRAM 1 & 3 

for (i=0; i<128; i++) { 

unsigned char metric, _tmp, mO , ml, _mO, _ml 
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metric = ( (*iramO++ ^ syml) + 

(*iraml++ ^ sym2) + 1) /2; 
_tmp= (metric << 1) -15; 
mO = *iram2++ + metric; 
ml = *iram3++ + (15 - metric) ; 
_mO = mO - _tmp; 
_ml = ml + _tmp; 

// assuming big endian; little endian has the shift on 
the latter minO 

*iram6++ = (min(mO, ml) << 8) | min (_mO,_ml) ; 

*iram7 [i » 4] |= (mO >= ml) « ( (2*i) & 

31) 

I (_mO >= _ml) « ((2*i + l) Sc31) ; 

} 

The corresponding data flow graph is an foUows shown in Fig. 38 (for now ignoring tho fact, 
that the IRAM accesses are mostly char accesses). The solid lines represent data flow, while 
the dashed lines represent event Aowt^ 



The following is a correspondin g parameter table. 



Parameter 


Value 


Vector length 


128 


Reused data set size 




I/O IRAMs 


6I + 20 


ALU 


25 


BREG 


few 


FREG 


few 


Data flow graph width 


4 


Data flow graph height 


11 


Configviration cycles 


11+128 



AA^e Some problems are immediately see Gom o problem snoticed: IRAM7 is fully busy reading 
and rewriting the same address sixteen times. Loop tiling to a tile size of sixteen gives the 
redundant load store elimination a chance to read the value once and accumulate the bits m 
the tomponm ^emDorarilv . writing the value to the IRAM at the end of this inner loop. Loop 
Fusion with the initialization loop then attew smav allow propagation of the zero values set in 
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the first loop to the reads of vp->dp->w[i] (IRAM7), eliminating the first loop altogether. 
Loop tiling with a tile size of 16 may also oliminatos eliminate the cfe 57 expressions for the 
shift valuesr-^Since the new inner loop only runs from 0 to 16, the value range analysis now 
finds that the & 31 expression is not limiting the value range any fiirther. 

All remaining input IRAMs are character (8 bit) based. So y.^e^ieedit may be required for 
split networks to split the 32-bit stream into four 8-bit streams which are then merged. This 
adds 3 shifts, 3 ands^ and 3 merges for every character IRAM. The merges could be 
eliminatedj when unrolling the loop body. However, unrolling ismay be limited to unrolling 
twice due to ALU availability as well as due to the fact, t hat IRAM6 is already 16 bit basedT 
tmfeffift g. Unrolling once requires a shift by 16 and an or to write 32 bits in every eyelet 
iinrnllin g. Unrolling fiirther cannot increase pipeline throughput any more. So the body is 
only unrolled once, eliminating one layer of merges. This yieldsmay yield two separate 
pipelines^ that each handle two eight bit slices of the 32-bit value from the IRAM, serialized 
by merges. 

The modified code now Inoka like mav be approximately as follows (unrolling and splitting 
omitted for simplicity): 

char *iramO= Branchtab2 9_l ; // 

XPPPreload(0, Branchtab2 9_l , 128/4); 

char *iram2= Branchtab2 9_2 ; // XPPPreload (2 , 

Branchtab2 9_2, 128/4); 

char *iram4= vp->old_metrics ; // XPPPreload (4 , vp->old 

metrics, 128/4) ; 

char *iram5= vp->old_metrics+128 ; // XPPPreload (5 , vp- 

>old_metrics+128, 128/4) ; 

short *iram6= vp->new_metrics; // XPPPreload (6, vp- 
>new_metrics, 128/2) ; 

unsigned long *iram7= vp->dp->w; // XPPPreload (7 , vp->dp->w, 

8) ; 

// syml & sym2 are in IRAM 1 & 3 
for ( i=0 ;_i<8;_i++) { 
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rlse= 0; 

for (i2=0; i2<32; i2+=2) { 

unsigned char metric, _tmp, mO, ml, _mO, _ml; 



metric = ({*iramO++ ^ syml) + 

(*iraml++ ^ sym2) + l)/2; 
_tmp= (metric << 1) -15; 
mO = *iram2++ + metric- 
mi = *iram3++ + (15 - metric) ; 
_mO = mO - _tmp; 
_ml = ml + _tmp; 

*iram6++ = (min(mO,ml) << 8) 1 min (_mO , __ml) ; 
rise = rise | ( mO >= ml) << i2 

I (_mO >= _ml) << (i2+l) ; 



} 



*iram7++ = rise; 

} 

The modified data flow graph (unrolling and splitting omitted for simplicity)^^ 
shown in Fig. 39. The splitting network fn.- nnn TP AT\/f- thms shown in Fig. 40. The bottom 
most level merge is omitted for each level of unrolling. 



The following is a corresponding parameter table 



Parameter 


Value 


Vector length 


128 


Reused data set size 




I/O IRAMs 


6I + 20 


ALU 


2*24+4*3(split)+2(ioin) = 62 


BREG 


few 


FREG 


few 


Data flow graph width 


4 


Data flow graph height 


11+3 (split) 


Configuration cycles 


14+64 



5. 4 .5 — Re-Normalizatiom 

The Normalization consists of a loop scanning the input for the minimum and a second loop 
that subtracts the minimum from all elements. There is a data dependency between all 
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iterations of the first loop and all iterations of the second loop. Therefore^ the two loops 
cannot be merged or pipelined. They wt Hmav be handled individually. 



Minimum Search 

5 The third loop is a minimum search on a byte array. 

char *iramO = vp->new_metrics ; // XPPPzeload (0 , vp- 
>new_metrics, 64/4) ; 
for (i=0; i<64 ; i++) 

minmetric = min (minmetric , *iramO++) ; 

10 



The following is a corresponding parameter table. 



Parameter 


Value 


Vector length 


64 


Reused data set size 




I/OIRAMs 


1+1 


ALU 


1 


BREG 


0 


FREG 


0 


Data flow graph width 


1 


Data flow graph height 


1 


Configuration cycles 


64 



Reduction recognition Qliminates mav eliminate the dependence for minmetric^ enabling a 

four-times unroll to utilize the IRAM width of 32 bits. A split network has to be added to 
1 5 separate the 8 bit streams using 3 SHIFT and 3 AND operations. Tree balancing majLre- 

dintributes distribute the minQ operations to minimize the tree height. 

char *iramO = vp->new metrics; // XPPPreload (0, vp->new 

^metrics, 16) ; 

for (i=0; i<16; i++) 
20 minmetric = min (minmetric, min ( min (*iramO + +, *iramO + + ) , 

min(*iramO++, *iramO++) ) ) ; 



The following is a corresponding parameter table. 



Parameter 


Value 


Vector length 


16 


Reused data set size 




I/OIRAMs 


1 I+l O 



NY01 1113556V1 



147 



MARKED-UP VERSION OF THE 
SUBSTITtJTE SPECIFICATION 



AT T T 


4*min 


BREG 


3*shln+3*shm 


FREG 


0 


Data flow graph width 


4 


Data flow graph height 


5 


Configuration cycles 


5+16 



Reduction recognition again oliminat e s m av eliminate the loop carried dependence for 
minmetric, enabling loop tiling and then unroll and jam to increase parallelismHfee.jrhe 
maximum for the tiling size is 16 IRAMs / 2 IRAMS = 8. Constant propagation and tree 
5 rebalancing fed«ee smav reduce the dependence height of the final merging expression: 
^Yys^j: *iramO= vp->new _metrics; 

// XPPPreload(0, vp- >new_metrics , 2) ; 

char *iraml= vp- >new_metrics+8 ; // XPPPreload ( 1 , vp- 
>new__metrics + 8 , 2) ; 
10 char *iram2= vp->new _metrics+16; // XPPPreload (2 , vp- 
>new_metrics + 16, 2) ; 

char *iram3= vp->new_metrics+24 ; // XPPPreload (3 , vp- 
>new_metrics+24 , 2) ; 

^har *iram4= vp->new _metrics+32 ; 

15 / 

/ XPPPreload (4 , vp- >new_metrics+32 , 2) ; 

char *iram5= vp- >new__metrics+40 ; // XPPPreload ( 5, vp- 
>new_metrics+4 0 , 2) ; 

char *iram6= vp->new_metrics+48 ; // XPPPreload (6, vp->new 
20 _metrics+4 8, 2) ; 

char *iram7= vp- >new_metrics+56 ; // XPPPreload (7 , vp- 
>new__metrics+56, 2) ; 

for (i =0; __i<2; i + + ) { 
25 minmetricO = min (minmetricO, min( min (*iramO++, 

*iraTnO++) , 

min(*iramO++, *iramO++) ) ) ; 
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minmetricl = min (minmetricl , min{ min (*iraml++ , 
*iraml++) / 

min(*iraml++, *iraml++) ) ) ; 
minmetric2 = min (minmetric2 , min ( min {*iram2++ , 
*iram2++) , 

min (*iram2++, *iram2++) ) ) ; 
minmetric3 = min (minmetric3, min( min (*iram3++ , 
*iram3++) , 

min (*iram3++, *iram3++) ) ) ; 
minmetric4 = min (minmetric4, min( min (*iram4++ , 
*iram4++) , 

min(*iram4++, *iram4++) ) ) ; 
minmetricS = min (minmetricB, min( min(*iram5++, 
*iram5 + + ) , 

min(*iram5++, *iram5++) ) ) ; 
minmetric6 = min (minmetricG, min ( min(*iram6++, 
*iram6++) , 

min (*iram6++, *iram6++) ) ) ; 
minmetric7 = min (minmetricV, min ( min (*iram7++ , 
*iram7++) , 

min(*iram7++, *iram7++) ) ) ; 

} 

minmetric = min (min (( min (minmetric_0 , minmetric_l) , 

min (minmetric_2 , minmetric_3 ) ) , 
min ( ( min (minmetric_4 , minmetric_5 ) , 
min (minmetric_6 , minmetric_7 ) ) ; 



The following is a conespondi n g parame ter table. 



Parameter 


Value 


Vector length 


2 


Reused data set sisse 




I/O IRAMs 


81 + 10 


ALU 


8*4*min = 32 


BREG 


8*(3*shhi+3*shm)= 48 


FREG 


0 


Data flow graph width 


8*4= 32 
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Data flow graph height 


5 


Configuration cycles 


8+2 



Re-Normalization 

The fourth loop subtracts the minimum of the third loop from each element in the array, 
read-modify-write operation has to be broken up into two IRAMs. Otherwise^ the IRAM 
ports will limit throughput. 

char * iramO= vp->new _metrics; // XPPPreload (0, vp->r 

^metrics, 64/4) 

char *iraml= vp- >new_metrics ; // XPPPreloadClean (1 , vp->new 
^metrics, 64/4) 

for (i =0; i<64; i++) 

*iraml++ = *iramO++ - minmetric; 



The following is a corresponding parameter table. 



Parameter 



Vector length 



Value 



64 



Reused data set size 



I/O IRAMs 



2I+10 



ALU 



1 



BREG 



FREG 



Data flow graph width 



Data flow graph height 



Configuration cycles 



64 



There are no loop carried dependencies. Since the data size is bytes, the inner loop can be 
unrolled four times without exceeding the IRAM bandwidth requirements. Networks 
splitting the 32-bit stream into 4 8-bit streams and rejoining the individual results to a 
common 32-bit result stream are inserted. 

char *iramO= vp->new ^metrics; // XPPPreload (0, vp->new 

^metrics, 16) 

char *iraml= vp->new ^metrics; // XPPPreloadClean (1, vp- 

>new metrics, 16) 

for (i=0; i<16; i++) { 

*iraml++ = *iramO++ - minmetric; 
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*iraml++ = *iramO++ - minmetric; 
*iraml + + = *iraTnO + + - minmetric; 
*iraml++ = *iramO++ - minmetric; 



The following is a corresponding parameter table. 



Parameter 


Value 


Vector length 


16 


Reused data set size 




llOIRAMs 


21+ 1 O 


ALU 


4*4(sub)=16 


BREG 


6*shln+6*shm=12 


FREG 


0 


Data flow graph width 


4 


Data flow graph height 


5 


Configuration cycles 


2(split)+4*l(subH2(ioin)= 8 



4 ) 

// XPPPreload 



Unroll and jam can be applied after loop tiling, in analogy to the third loop, but loop tiling is 
now limited by the BREGs used by the split and join networks. The computed tiling size 
10 (unroll factor) is 64 BREGs/12 BREGs- = 5, which is replaced by 4, since the same 
throughput is achieved with less over-head, 
char *iramO= vp->new _metrics; 

XPPPreload (0, vp->new_metrics, 4) 

char *iraml= vp->new _metrics ; 
15 XPPPreloadClean (1, vp->new ^metrics, 

char *iram2= vp->new _metrics+16; 

vp->new _metrics+16, 4) 

char *iram3= vp->new _metrics+16; 

(3, vp->new _metrics+16, 4 ) 
20 char *iram4= vp->new _metrics+32; // XPPPreload 

vp->new _metrics+32, 4) 

char *iram5= vp->new _metrics+32; // XPPPreloadClean 

(5, vp->new _metrics+32, 4) 

char *iram6= vp- >new_metrics+48 ; // 
25 XPPPreload (6, vp-> new_metrics+4 8 , 4) 

char *iram7= vp->new _metrics+48; // XPPPreloadClean 

(7, vp->new_metrics+48 , 4) 



// 
II 
(2, 

// XPPPreloadClean 

(4, 
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for (i=O0; i<4; i++) { 
*iraml++ = *iramO++ - minmetric; 

*iraml++ = *iramO++ - minmetric; 

*iraml++ = *iramO++ - minmetric; 

*iraml++ = *iramO++ - minmetric; 

*iram3++ = *iram2++ - minmetric; 
second pipeline 

*iram3++ = *iram2++ - minmetric; 



// first pipeline 



// 



*iram3++ = *iram2++ - minmetric; 



*iram3++ = *iram2++ - minmetric; 



*iram5++ = *iram4++ - minmetric; 



third pipeline 



*iram5++ = *iram4++ - minmetric; 



*iram5++ = *iram4++ - minmetric; 



// 



*iram54-+ = *iram4 + + - minmetric; 

*iram7++ = *iram6++ - minmetric; 
fourth pipeline 

*iram7++ = *iram6++ - minmetric; 



// 



*iram7++ = *iram6++ - minmetric; 



*iram7++ = *iram6++ - minmetric; 



The following is a corresponding parameter table. 



Parameter 


Value 


Vector length 


4 


Reused data set size 




I/O IRAMs 


51+40 
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ALU 


4*f6fsDlitH4(sub)+6(ioin)) = 64 


BREG 


4*(6*shln+6*shm)= 48 


FREG 


0 


Data flow graph width 


16 


Data flow graph height 


1 


Configuration cycles 


2(split)+4* l(sub)+20oin)= 8 



^=4=6 — Final Code 

Finally wo arrivo at the following code mav be obtained : 

int update_viterbi2 9 (void *p, unsigned char syml, unsigned 

5 char sym2) { int i; 
struct v2 9 *vp = 
p; unsigned char 
*tmp; 

int normalize = 0; 

iO 

// initialization loop eliminated 



//for (i=0; i<8; i++) 

// vp->dp->w [i] = 0; 

15 

// Configuration for butterfly loop 

qI^q^j; *iramO= Branchtab2 9_l ; // 

XPPPreload(0, Branchtab2 9_l , 12 8/4) ; 

char *iram2= Branchtab29 J2 ; II XPPPreload (2, 
20 Branchtab2 9_2, 12 8/4) ; 

char *iram4= vp- >old_metrics ; // XPPPreload (4 , vp->old 

metrics, 128/4) ; 

char *iram5= vp->old _metrics+12 8 ; // XPPPreload (5 , 

vp->old_metrics+128, 128/4) ; 
25 short *iram6= vp- >new_metrics ; // XPPPreload (6, vp- 

>new_metrics , 128/2) ; 

unsigned long *iram7 = vp->dp->w; // XPPPreload (7, vp->dp- 

>w, 8) ; 
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/ / syml & SYm2 are in IRAM 1 & 3 
for (__i = 0; _i<8; + + ) { 
rlse= 0; 

for (i2=0; i<32; i2+ =2) { // unrolled once 
unsigned char metric, _tmp, mO, ml, _mO, _ml 

metric = ( (*iramO++ ^ syml) + 

(*iraml++ ^ sym2) +1) /2; 
_tmp= (metric << 1) -15; 



mO = *iram2++ + metric; 

ml = *iram3++ + (15 - metric) ; 

_mO = mO - _tmp; 

_ml = ml + _tmp; 

*iram6++ = (min (mO, ml) << 8) | min (_mO,__ml) ; 
rise = rise | ( mO >= ml) << 12 

1 (_mO >= _ml) << (12+1) ; 

} 

*iram7++ = rise; 

} 

/* Renormalize metrics */ 
if (vp->new _metrics [0] > 150) { 
int 1 ; 

// Configuration for loop 3 

^har *iramO= vp->new ^metrics; 

// XPPPreload (0, vp->new ^metrics, 8) ; 

char *iraml= vp->new _metrics+8; // XPPPreload (1, vp- 

>new __metrics+8, 8) ; 

^j^^-^ *iram2= vp— >new_metrics+16 ; 

// XPPPreload (2, vp->new 

metrics+16 , 8) ; 
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char 
// 

metrics+24 , 8) ; 
char 
// 

_Tnetrics+32 , 8) ; 
char 
// 

_metrics+4 0 , 8) ; 
char 
// 

__metrics+48 , 8) ; 
char 

// 

_metrics + 56 , 8) ; 

for (i=0; _i<2; i++) { 

minmetricO = min (minmetri 
*iramO++) , 



*iraTn3= vp->new _metrics+24 ; 
XPPPreload (3, vp->new 

*iram4= vp->new _metrics+32; 
XPPPreload (4, vp->new 

*iram5= vp- >new_metrics+40 ; 
XPPPreload (5, vp->new 

*iram6= vp->new_metrics+4 8 ; 
XPPPreload (6, vp->new 

*iram7= vp->new _metrics+56 ; 
XPPPreload (7, vp->new 



0, min( min(*iramO++, 

Tnin(*iramO++, *iramO++) ) ) ; 



minmetricl = min (minmetricl , min( min (*iraml++ , 
*iraml++) , 

min (*iraml++, *iraml++) ) ) ; 
minmetric2 = min (minmetric2, min ( min (*iram2++ , 
*iram2 + + ) / 

min (*iram2++, *iram2++) ) ) ; 
minmetric3 = min (minmetric3, min( min (*iram3++ , 
*iram3++) , 

min{*iram3++, *iram3++) ) ) ; 
minmetric4 = min (minmetric4, min( min (*iram4++ , 
*iram4++) , 

min(*iram4++, *iram4++) ) ) ; 
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minmetricB = min (minmetricB , min( min (*iram5++ , 
*iram5++) , 

min(*iram5++, *iram5++) ) ) ; 
minmetricS = min {minmetric6 , min ( min (*iram6++ , 
*iram6++) , 

min (*iram6++, *iram6++) ) ) ; 
minmetric? = min (minmetric?, min { min {*iram7++ , 
*iram7++) , 

min (*iram7++, *iram7++) ) ) ; 

} 

minmetric = min (min ({ min (minmetric_0 , minmetric_l) , 

min (minmetric_2 , minmetric_3 ) ) , 
min( ( min(minmetric_4, minmetric_5) , 
min {minmetric_6 , minmetric_7) ) ; 
// minmetric is written to the output IRAM 



// Configuration for loop 4, minmetric 
char *iramO= vp- >new_metrics ; 
(0, vp->new_metrics, 4) 
char *iraml= vp- >new_metrics ; 
(1/ vp->new_metrics, 4) 
char *iram2= vp- >new_metrics+16 ; 
(2, vp->new_metrics+16, 4) 
char *iram3= vp- >new__metrics+16 ; 
(3, vp->new_metrics+16, 4) 
char *iram4= vp->new__metrics+32 ; 
(4 , vp->new_metrics+32 , 4) 
char *iram5= vp- >new__metrics+32 ; 
(5, vp->new_metrics+32 , 4) 
char *iram6= vp->new__metrics+4 8 ; 
(6, vp->new_metrics+4 8 , 4) 
char *iram7= vp- >new_metrics+48 ; 
(7, vp-'>new_metrics+4 8 , 4) 
for (i=0; i<4; i++) { 



is in an input IRAM 
// XPPPreload 

// XPPPreloadClean 

// XPPPreload 

// XPPPreloadClean 

// XPPPreload 

// XPPPreloadClean 

// XPPPreload 

// XPPPreloadClean 
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*iraml++ 

pipeline 

*iraml++ 
*iraml++ 
*iraml++ 
*iram3++ 

pipeline 

*iram3++ 
*iram3++ 
*iram3++ 
*iram5++ 

pipeline 

*iram5++ 
*iram5++ 
*iram5++ 
*iram7++ 

pipeline 

*iram7++ 
*iram7++ 
*iram7++ 

} 



= *iramO++ - minmetric; 



// first 



*iramO++ 
*iramO++ 
*iraTnO + + 
*iram2++ 

*iram2++ 
*iram2++ 
*iram2++ 
*iraTn4++ 

*iram4++ 
*iram4++ 
*iraTn4 + + 
*iram6++ 

*iram6++ 
*iram6++ 
*iram6++ 



minmetric ; 
minmetric ; 
minmetric ; 
minmetric ; 

minmetric; 
minmetric ; 
minmetric ; 
minmetric; 

minmetric; 
minmetric ; 
minmetric; 
minmetric; 

minmetric ; 
minmetric ; 
minmetric ; 



normalize = minmetric; 



} 



vp->dp++ ; 

tmp = vp->old_metrics ; 
vp->old_metrics = vp->new_metrics ; 



// second 



// third 



// fourth 



vp- >new_metrics = tmp; 



return normalize; 

} 
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Performance Considerations 

In this example we-d ethere is not4iave a high data locality. Every input data item is read 
exactly once. Only in the case of re-normalization, the new_ metric array is rereadre-read 
and re-written. To fully utilize the PAE array, loop tiling was^ used ^n conjunction with 
reduction recognition to break dependencies using algebraic identities. In some cases 
(minimum search) this lead smav lead to extremely short vector lengths. This deesis not huFta 
problem as it still does reduce the running time of the configuration and the transfer time 
from the top of the memory hierarchy to the IRAMs stays the same. The vector length 
eeuldcan be increased if the outer loop that calls the function wasis known — fee. The 
additional data eeuldcan be used to increase the fill grade of the IRAMs by unrolling the 
outer loop, keeping the vector length longer. This would further increase configuration 
performance by reducing overall pipeline setup times. 



Performance of XPP for this example is compared to a hypothetical' superscalar RISC- 
architecture. Wo assume anA n average issue width of two is assumed, which means that the 
RISC on average executes two operations in parallel. The estimate is achieved by counting 
instructions for the source code in 5.12. presented under the headi n g "Interorocedural 
Optimizations and Scalar Transformations." See the table below. 

Bfly Setup Butterfly MIn Setup MIn Search Norm Setup Normatee 



Operation Cyclee 
ADRCOJS2IF" 



LD/ST 

LDI 

MOVE 

BITOP 

ADD/SUB 

MULT 

OMP 

cycles 

Count 

Issue Width 

TtftatCydes 



5 
3 



8 
4 
4 
10 
20 



2 

1 



-23" 
1 

12 



3 

— r<r 

128 
4480 



1 
3 
2 



64 
352 



3 
•1 



-4 TO" 

1 6^ 

Est RISC cycles 

2 320 6168 RISC Cyctes 



§^ MPEG2 encoder/decoder 

&T&A — Quantization /Inverse Quantization (quantx) 

The quantization file eent^a smav include routines for quantization and inverse quantization 
of 8x8 macro blocks. These functions may d iffer for intra and non-intra blocks-aad 
furthormore . Furthermore, the encoder dintinguinhos mav distinguish between MPEGl and 
MPEG2 inverse quantization. 
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This give smav give a total of 6 functions, which are all candidates for function inlining, since 
they do not use the XPP capacity by far. 



Since all functions may h ave the same layout (some checks, one main loop running over the 
macro block quantizing with a quantization matrix), wn rnnc e ntmt e on focus is placed on 
"iquant Jntra"," the inverse quantization of intra-blocks, since it eentainsmay include all 
elements found in the other procedures ^(The non Jntra quantization loop bodies are more 
complicated, but add no compiler complexity). In the source code the mpegl part is already 
inlined, which is straightforward since the function is statically defined and eentmnsincludes 
no function calls itself Therefore^, the compiler iriine smav inline it and dead function 
elimination romoves m av remove the whole definition. 

Original Code 

void iquant_intra (src , dst , dc_prec , quant _mat,mquant) 
short *src, *dst; 
int dc_prec; 

unsigned char *quant_mat; 
int mquant ; 

{ 

int i, val, sum; 

if (mpegl) { 

dst [0] = src[0] « (3-dcjprec) ; 

for (i=l; i<64; i++) 

{ 

val = (int) (src [i] * quant _mat [i] *mquant) /16; 

/* mismatch control */ 
if- ((val&ll)= 0 ScSc val!=0) 
val+= (val>0) ? -1 : 1; 

/* saturation */ 
dst[i] = (val>2047) ? 2047 : 

val) ; 
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} 

} 

else 
{ 

sum = dst[0] = src [OO] << (3- dc_prec) ; 
for (i=l; i<64; i++) 

{ 

val = (int) (src [ i ] *quant_mat [i] *mquant) /16; 

SUTn+= dst[i] = (val>2047) ? 2047 : ( (val<-2048) ? -2048 

: val) ; 
} 

/* mismatch control */ 
if ((sum&l)==0) 
dst [63 ] ^= 1; 

} 

} 

Interprocedural Optimizations 

Analvsin gA nalvzing the loop hodies-^^ew s. it can be seen that they may e asily fit to the XPP 
and do not use the maximum of resources by far. The function is called three times from 
module putseq. c. With inter-module function inlining^ the code for the function call 
disappears mav disappear and i smav be replaced with the function ^Therefore^ it Feadsinay_be 
as follows: 

for (k=0; k<mb _height*Tnb _width; k++) { 
if (mbinfo [k] . mb _type & MB_INTRA) 
for (j =0; j<block _count ; j ++) 
if (mpegl) { 



blocks [k*block_count+j] [0] = blocks [k*block_count+j ] [0] 
<< (3-dc_prec) ; 

for (i=l; i<64; i++) { 
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val = (int) ( blocks [ k*block_count+ j ] [i] * intra_q 
[i] *mquant) /16; 



} 

}else { 

sum = blocks [k*block_count+j ] [0] = 
blocks [k*block_count+j ] [0] « 

(3-dc_jprec) ; 

for (i=l; i<64; i++) { 
val = (int) (blocks [k*block_count+j ] [i] * 



intra_q[i] *mquant) /16; 

} 

} else { 



} 



Basic transformations 

Since global mpeg 1 does not change within the loop^ vinswitching mevesmay move the 
control statement outside the j loop and produc e s mav produce two loop nests, 
for (k=0; k<mb_height*mb_width; k++) { 
if (mbinfo[k] . mb__type & MB_INTRA) 
if (mpegl) 
for (j=0; j<block _count ; j++) { 
blocks [k*block_count+j] [0] = blocks [k*block_count+j ] [0] 
<< (3-dc_prec) ; 

for (i=l; i<64; i++) { 
val = (int) ( blocks [k*block _count+j] [i] * intra__q[i] 

*mquant) /16; 



} 

NY01 1113556V1 161 MARKED-UP VERSION OP THE 

SUBSTITUTE SPECIFICATION 



} 

else 

for (j=0; j <block_count ; j++) { 
sum = blocks [ k*block _count+j ] [0] = 
blocks [k*block_count+j ] [01 << 

(3-dc_prec) ; 

for (i=l; i<64; i++) { 
val = (int) ( blocks [ k*block _count+ j ] [i ] * 
intra_q [i] *mquant) /16; 

} 

} 

} 

Furthermore^ the following transformations aro don e mav be performed : 

■ A peephole optimization redttee smav reduce the divide by 1 6 to a right shift 4. This 
is mav be essential since we do not contjidor loop bodies containingincluding division 
for the XPP are not considered . 

■ Idiom recognition fedaee smav reduce the statement after the "saturation" comment to 
dst [i] = min(max(val, -2048), 2047^ 

Increasing parallelism 

Now wo wan tit mav be desired to increase parallelism. The j-i loop nest is a candidate for 
unroU-and-jam when the interprocedural value range analysis finds eut-that block _count can 
only get the vales 6, 8, or 12. Therefore^ it has a value range [6,12] with the additional 
attribute to be dividable by 2. Thus^ an unroll -and jam v^th the factor 2 is applicable (the 
resource constraints would choose a bigge f greater value). Since no loop carried 
dependencies exist, this transformation is safe. 

ftThds is to say that the source code contains a manually peeled first iteration. This peeling 
has been done because the value calculated for the first block value is completely different 
from the other iterations^ and the control statement in the loop would cause a major 
performance decrease on traditional processors. Although this does not prevent unroU-and- 



NY01 1113556v1 



162 



MARKED-UP VERSION OF THE 
SXJBSTITUTE SPECIFICATION 



jam (because there are no dependencies between the peeled efoff first iteration and the rest of 
the loop), the transformation must be prepared to handle such cases. 



After unroll ^and jam^ the source code looks lik e mav be approximately as follows (only one 
of the nests shewe dshown and the peeled first iterations moved in front); 
for (j=0; j<block _count; j+=2) { 

blocks [k*count+j ] [0] = blocks [k*count+j ] [0] « (3 -de 
prec) ; blocks [k*count+ j +1] [0] = blocks [k*count+ j +1] [0] 
« (3-dc prec) ; 
for (i=l; i<64; i++) { 

val = (int) (blocks [k*count+j] [i] *intra__q[i] *mbinfo [k] . 
mquant) >>4; 

/* mismatch control */ 
if ((val &1) ==0 && val ! =0 ) 
val+= (val>0) ? -1 : 1; 

/* saturation */ 

blocks [k*count+j ] [i] = min(max(val, -2048), 2047); 

val = 

(int) (blocks [k*count + j+l] [i] *intra__q [i] *mbinf o [k] .mquant) >>4; 

/* mismatch control */ 

if ((val&l)==0 && val!=0) 
val+= (val>0) ? -1 : 1; 

/* saturation */ 
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blocks [k*count+j+l] [i] = min (max (val , -2048) 



2047) 



} 



5 Further parallelism can be obtained by index set splitting. Normally used to break 

dependence cycles in the DDG, it can here be used to split the i-loop in two and let two sub- 

con figurations^ configurations (sub-configuration is chosen as a working title for 
configurations that include independent networks that do not interfere) work on distinct 
blocks of data. Thus^ the i loop is split into 2 or more loops which work on different subsets 
10 of the data at the same time. 



In contrast to the FIR-Filter, edge detector and matrix multiplication benchmarks, which all 
use data types fitting perfectly to the XPP[[^]] fit is assumed that the s ize of int is chosen to 
15 be the XPP architecture data bit width, as everything else would not lead to any feasible 
results the MPEG2 codec uses all data types commonly used on a processor for desktop 
applications. Written for the Intel x86 and comparable architectures, wo muot asoum eitmay 
be assumed that the sizes of char, shorty and int are 8,16, and 32 respectively. Assuming that 
the XPP has a bit width of 32 wo must tak e precautions should be taken for the smaller data 



Therefore-we-spMt^ the stream of data packets with each packet containingincluding 2 or 4 
values of the shorter data type may be split into 2 or 4 streams. I f w e hav e enough resources 
are left, this will cause no performance penalty. Each of the divided streams ismay be sent to 
25 its own calculation network : thorofor e . Therefore, in every cycle^ two short or four char 
values af emay be handled. Nevertheless^ this eause smav cause an area penalty^ because^ 
besides the split-merge elements, the whole data flow graph has to be duplicated as often as 
needed. Figure ei Fig. 41 shows how short values are handled. It shows the splitting of short 
values into two streams and the merging of the streams after the calc ulation. The packet is 

^ sub configuration is choGon as a working titl e for configurations which contains ind e pendent n e two :xks 

that do not int e rf e r e : 

^ W o asGum e that th e siz e of int is chosen to be tho XPP architecture data bit width. Everything e lse 

would not lead to any feasibl e r e sult 



Handling the data types 



20 



types. 
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split into its hi- and lo part by shift operations and merged behind the calculation branches. 
The legality of this transformation is the same as with loop unrolling^ with an unrolling factor 
as big as the data type isbeing smaller as the architecture data type. 

T Infortunatolv this This, however, is not the end of the pole. ^FheTt mav be fiirther required for 
the compiler furth e r has t o etssur eensure that every intermediate result which produces an 
over/under-flow for the shorter data type does the same with the bigger data type. Therefore^ 
it has to insert clipping operations which assureensure that the network calculates with real 16 
or 8 bit valu evalues, respectively. 

Figure 6i Splitting short values into t\K'0 streams and merging them 
after the calculation. This method causes no performance penalty 

If the configuration size does not allow the whole loop body to be duplicated or dependencies 
prevent this, w ethere is still have th ea possibility to merge of merging the split values again. 
This of course causes a performance penalty to the previous solution, because the throughput 
is only one (short) value/cycle^iew. Figure 65 Fig^ shows how the merge is done. Instead 
of streaming parallel through two networks^, the values are serialized and de-serialized again 
after the network. /^iWn 6S Merging thc The split values are merged before the network. An 
event gen craror generator drives the merge and ^temia gDemux PAEs. This figure ¥ig,A2 
replaces the 2 blaclit wo boxes labeled "network" in Figure dfPia. 41. 

— Inverse Discrete Cosine Transformation (idct.c) 
The idct-algorithm_i&-may_be used for the MPEG2 video decompression algorithm. It 
operates on 8x8 blocks of video images in their frequency representation and transforms them 
back into their original signal form. The MPEG2 decoder contains a transform-ftmction that 
calls idct for all blocks of a frequency-transformed pictxire to restore the original image. 

The idct function conGists o fi nav include two for-loops. The first loop calls idctrow, and the 
second calls idctcol. Function inlining is able to eliminate the fimction calls within the entire 
loop nest stracture so that the numeric code is not interrupted by fimction calls anymore. 
Attethe fin another embodiment a way to get rid of function calls between the loop nest is 
loop embedding that pushes loops from the caller into the callee. 
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Original Code (idct.c) 

/* two dimensional inverse discrete cosine transform */ 
void idct (block) 



short *block; 



{ 



int i ; 



for {i=0; 



idct row 



i<8; i++) 
(block+8 *i) 



for (i=0; 



i<8; i++) 



idctcol 



(block+i) 



The first loop ehaftges mav change the values of the block row by row. Afterwards^ the 
changed block is further transformed column by column. AWn this embodiment, all rows 
have to be finished before any column processing can be started. The function is illustrated 



Dependency analysis detee temav detect true data dependencies between row processing and 
column processing. Therefore , it mav be required for the processing of the columns-has to be 
delayed until all rows are done. The innermost loop bodies idctrow and idctcol are nearly 
identical. They process numeric calculations on eight input values (column values in case of 
idctcol and row values in case of idctcol). Eight output values are calculated and written 
back (as column/row). Idctcol additionally applies clipping before the values are written 
back. Th^^ ^^^>^y mnrontrntn nn idctcol Accordingly, idc t col is presented herein. The 
code mav be as follows : 
/* column (vertical) IDCT 



in Fig. 43. 
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* 7 pi 

* dst[8*k] = sumc [il] * src [8*1] * cos(-- * 

* 1 = 0 8 

(k + -) *1) 
2 

* 

* where: c [0] = 1/1024 

* c[1..7] = (1/1024)* sqrt{2) 

*/ 

static void idctcol (blk) 

short *blk; 

{ 

int xO, xl, x2, x3, x4 , x5, x6, x7, x8 ; 



/* shortcut */ 

if (! (xl = (blk[8*4] «8) ) ) I (x2 = blk[8*6]) | 

(x3 = blk [8*2]) I (x4 = blk [8*1]) | (x5 = blk [8*7]) | 
(x6 = blk [8*5]) I (x7 = blk [8*3]))) 

{ 

blk[8*0] =blk[8*l] =blk[8*2]=blk[8*3]=blk[8*4]=blk[8*5] = 
blk[8*6] =blk[8*7] = iclp [ (blk[8*0] +32)>>6]; 
return; 

} 

xO = (blk [8*0] <<8) + 8192; 



/* first stage */ 

x8 = W7* (x4+x5) + 4; 

x4 = (x8+ (W1-W7) *x4) >>3; 



x5 = (x8-(Wl+W7) *x5)>>3; 
x8 = W3* (x6+x7) + 4; 
x6 = (x8- (W3-W5) *x6) >>3; 
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x7 = (x8- (W3+W5) *x7) >>3 ; 



/* second stage */ 
x8 = xO + xl; 
xO -= xl; 

xl = W6* (x3+x2) + 4 ; 



x2 = (xl- (W2+W6) *x2) >>3 
x3 = (xl+ (W2-W6) *x3) >>3; 
xl = x4 + x6; 
x4 -= x6; 

x6 = x5 + x7; 
x5 - = x7 ; 

/* third stage * / 

x7 = x8 + x3 ; 

x8 - = x3 ; 

x3 = xO + x2; 

xO -= x2; 

x2 = (181* (x4+x5) +128) >>8; 
x4 = (181* (x4-x5) +128) >>8; 



/* fourth stage * / 

blk[8*0] = iclp [ (x7+xl) >>14] 

blk[8*l] = iclp [ (x3+x2) >>14] 

blk[8*2] = iclp [ (x0+x4) >>14] 

blk[8*3] = iclp [ (x8+x6) >>14] 

blk[8*4] = iclp[ (x8-x6) >>14] 

blk[8*5] = iclp[(x0-x4) >>14 ] ; 

bl k [ 8 * 6 ] = iclp [ (x3-x2) >>14 ] 

blk[8*7] = iclp[ (x7-xl) >>14] ; 
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} 



Wl - W7 are macros for numeric constants that are substituted by the preprocessor. The iclp 
array is used for clipping the resuhs to 8-bit values. It is fully defined by the init Jdct 
function before idct is called the first time: 
void init_idct() 

{ 

int i ; 

iclp = iclip+512; 
for (i= -512; i<512; i + -f) 
iclp[i] = (i<-256) ? -256 : ( (i>255) ? 255 : i) ; 

} 

A special kind of idiom recognition (fimction recognition) is able to replace the calculation of 
each array element by a compiler known fimction that can be realized efficiently on the XPP. 
If the compiler features whole program memory aliasing analysis^ it is able to replace all uses 
of the iclp array with the call of the compiler known fimction. Altematively^ a developer can 
replace the iclp array accesses manually by the compiler known saturation fimction callsr-fte 
illustration . Fig, 44 shows a possible implementation for saturate(val,n) as aiLNML 
schematic using two ALUs. In this case^ it is necessary to replace array accesses like iclp[i] 
b vwith saturate(i,256). 

The /* short cut* / code in idctcol ^eed smav speed column processing up if xl to x7 is 
zero. This breaks the well-formed structure of the loop nest. The if-condition is not loop 
invariant and loop unswitching cannot be applied. Rut noneth e l e ss Nonetheless, the code 
after shortcut handling is well suited for the XPP. It is possible to synthesize if-conditions 
for the XPP (speculative processing of both blocks plus selection based on condition) but this 
would just waste PAEs without any performance benefit. Therefore:, the /* short cut* / 
code in idctrow and idctcol has to be removed manually. The code snippet below shows the 
inlined version of the idctrow-loop with additional cache instructions for XPP control: 
void idct (block) 
short *block; 
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{ 

int i ; 

XPPPreloaddDCTROW _CONFIG) ; // Loop Invariant 

for (i=0; i<8; i++) { 
short *blk; 

int xO, xl, x2, x3, x4 , x5, x6, x7, x8; 
blk = block+8*i; 

XPPPreload (0, blk, 8) ; 

XPPPreloadCleand, blk, 8) ; //IRANI is erased aMand 
assigned to blk 

XPPExcute(IDCTROW_CONFIG, IRAM(O) ; IRAM (1) ) ; 

} 

for (i=0; i<8; 1++) { 

} 

} 

As the configuration of the XPP does not change during the loop execution^ invariant code 
motion has moved out XPPPreload(IDCTROW_CONFIG) from the loop. 

NML Code Generation 
Data Flow Graph 

As idctcol is more complex due to clipping at the end of the calculation s wo decided to tolte^ , 
idctcol is well suited as a,representative loop body for a presentation of the data flow graph. 

Th e figur e on tho next pag eFig,_45 shows the data flow graph for the 
IDCTCOLUMN_CONFIG. A heuristic has to be applied to the graph to estimate the 
resource needs on the XPP. In eurthis example^ the heuristic produces the following results: 
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ADD, SUB 


MUL 


«X, »X 


Saturate (x,n) 


Ops needed 


35 


11 


18 


8 




ALUs 


FREGs 


BREGs 


Res. left 


19 


80 


45 




Res. avail. 


64 


80 


80 





Fortunatoly the_The data flow graph fits into an XPP64 and we-emthis example may proceed 
without loop dissevering{[!^^ (splitting the loop body into suitable chunks) for thio example . 
5 See Joao M.P. Cardoso et al.. suvra. 

Address Generatiom 

To fully synthesize the loop body wo have to fac e t he problem of address generation for 
accessing the dat a must be addressed . 

10 

For IDCTCOLUMN^CONFIG wo havo to o o l e ct^ the n^^ element of every row must be 
selected, which means an address serial of (0,8,16. . . 1,9,17. . . 7,15,23. . . ). W o use 
tweTwo counter macros mav be used for address generation as shown opposit e , in Fig. 46. 
The upper counter increments by eight and the lower counter increments by one. The IRAM 
1 5 output is passed to the data flow graph of IDCTCOLUMN. If all (eight) row elements of a 
column are available^ SWAP is switched through to the data flow graph input and the 
calculation for a new column begins. 

For the IDCTROW_CONFIGi the address generation is very simple as the IRAM already 
20 containshas the block in the appropriate order (row after row as it has to be accessed). Again^ 
by using SIUP (stepped iterative up)-counter macros as described in the XPP tutorial it is 
possible to map linear address expressions to NML-code in a generic way. As 
IDCTROW CONFIG accesses a two-dimensional array w e n ee d ^ two SlUP-countersjnay^ 
needed in the corresponding NML code. The colunm-elements have to be accessed row after 
25 row so the upper count e rs counter's increment is one and the lower counterscounter's 

increment is eight. However, the NML code for this access pattern (0. . .5,6,7,8,9. . .63) can 
be reduced to one single counter (or to FIFO-mode IRAM access). 

' XPP VC: A C Compil e r with Temporal Partitioning for the PACT XPP Architocturo, J. M. P. Cardoso 

and Markus Woinhardt 
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Address generation for write access is mav be implemented in the same manner. The 
resources have to be updated to take this additional code into account. It takes 2*(8+8+2* 1) 
FREGs and 2*(2+l) more BREGs in the worst case^ which is still available on the XPP. 

If IRAM use is not critical^ it is also possible to distribute the data on several IRAMs to 
improve the memory throughput into the XPP-array. This task has-temay be done by the 
RISC-core or by a more sophisticated XPP-cache controller. 

Further Enhancing XPP Utilization 

As mentioned nt the bopinnin Q above. idct is called for all data blocks of a video image (loop 
in transform, c). This circvmistance nllnivr nr tn f i irtVinr imprnv e mav allow for improvement 
of the XPP utilization. 

When we4ee klooking at the data flow graph of idctcol in detail^we-see, it can be seen that it 
forms a very deep pipeline. Tf wn bring hnclc tn our mind Considering that the 
IDCTROW_CONFIG runs only eight times on the XPP^ which meet^me^ that only 64 (8 
times 8 elements of a column) elements are processed through this pipeline^ and tha t wo have 
to wait then until all data loft the pipeline boforo we con change from t he XPP configuration 
to the IDCTCOLUMN_CONFIG configuration to go on with colvmin processing then it gets 
nhviniir. that nomothi ng must wait until all data has left the pipeline, this example is 
suboptimal in our exampl e. 

Problem (Pipeline- Depth) 

The pipeline is just too deep for processing only eight times eight rows. Filling and flushing 
a deep pipeline is expensive if only little data is processed with it. First the units at the end of 
the pipeline are idle and then the units at the beginbeginning are unused— as shown in Fig. 
4L 

Solution (Loop Tiling) 

It is profitable to use loop interchange for moving the dependencies between row- and 
column processing to an outer level of the loop nest. The loop that calls the idct-function (in 
transform, c) on several blocks of the image is-has no loop interchange preventing 
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dependencies. Therefore^ this loop can be moved inside the loops of column and row 
processing , as shown in Fig . 48^ 



Now the processing of rows and columns can be applied on more data (by applying loop 
tilmg) nnd thorofore . Therefore, filling and flushing the pipeline can be neglected. 

Constraints (Cache Sensitive Loop Tiling) 

The caching hierarchy has to be taken into accovmt when wo d e fin e defining the number of 
blocks that will be processed by the IDCTROW^CONFIG. Romombor, wo noed_ As 
discussed above, the same blocks are needed in the subsequent IDCTCOLUMN_CONFIG 
rnnfignratmn ! Wft havo to tako cor e . It should be cusurcd that all blocks that are processed 
during IDCTROW^CONFIG fit into the cache. Loop tiling has to be applied v^th^ respect to 
the cache size so that the- processed data fits into the cache. 

IRAM reuse between different configurations 

This example implies' another bandwidth optimization that is just a more conGoquent another 
version of loop tiling. Instead of transferring data fi-om row processing to column processing 
via the memory hierarchy (cache sensitive loop tiling takes care that only the cache memory 
is accessed) wo can completely bypass^ the memory interface can be completely bypassed by 
using the output IRAM of Config A as input IRAM of Config B^. as shovm in Fig. 49. 

Putting all together 

If we apply cache sensitive loop tiling, IRAM reuse^ and fimction in-lining-^, the example 
can be fiirther optimize our example: optimized. 

Finally^ the idct-function getsbecomes completely inlined in transform.c. If block_count is^ 
e.g.^ 6 and we assum e it is assumed that 64*6 words do not exceed the cache size^, then wo can 
trans form -the example may be transformed t o: 
// transform. c 

block = blocks [ k* 6 ] ; 
XPPPreload ( IDCTROW _CONFIG) ; 
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XPPPreload(0, block, 64*6) ; //IRAMO gets 64 words from 6 
blocks XPPPreloadClean (1, block, 64*6) ; // erase IRAMl and 
assign to the 6 blocks 

XPPExecute (IDCTROW _CONFIG, IRAM(O) , IRAM (1)) ; 

5 

XPPPreload ( IDCOLUMN _CONFIG) ; 

XPPPreload (1, block, 64 *6 ) ; // redundant -> will be 
eliminated XPPExecute (IDCOLU]y[N_CONFIG, IRAM(l), IRAM (2)); 

10 

The address generation in IDCTROW_CONFIG and IDCOLUMN_CONFIG has to be 
modified for reflecting the different data block size - caused by loop tiling - that has to be 
processed. This can be implemented by an additional SUIP counter that generates the block 
offsets inside the tiles ^. as shovm in Fig. 50. 

15 

The following table contains provides architectural parameters for IDCTROW CONFIG and 
IDCOLUMN_CONFIG of the final result. It relies on a cache that is able to store block 
_count blocks. As two configurations are executed in this example^ the configuration cycles 
have to be taken twic e and thoroforo . Therefore, the total configuration cycles are 2 x (block 
20 _countx64 + (12-+2x8)x2). 



Parameter 


Value 


Vector length 


8 words 


Reused data set size 


block count x 64 words 


I/O IRAMs 


3 (one shared) 


ALU 


45 FUs 


DREG 


41 FUs 


FREG 


36 FUs 


Data flow graph width 


8 


Data flow graph height 


12 


Configuration cycles 


block_count x 64 + (12 +2*8) x 2 



Performance Considerations 

In this example^ it is possible to exploit high data locality^ which means that many operations 
are performed on a limited memory range. The performance of the proposed XPP solution of 
25 this embodiment is compared to a hypothetical superscalar RISC-architecture. W o assum e 



NY01 1113556V1 



MARKED-UP VERSION OF THE 
SXJBSTITUTE SPECIFICATION 



£HiAn issue width of two is assumed, w hich means that the RISC executes on average two 



operations in parallel. 



Ops for Row/Column 
LD/ST 18 
ADRCOMP 16 
ADD/SUB 35 
MULT 11. 
SHIFT 18 
SAT 8 
Issue Width 



Proc. Rows 
ProcCols 



Est RISC cycles 

2 32 

1 16 

1 35 

2 22 
1 18 
4 32 



8 

8 



RISCCydBlk 
XPPCyc/Blk 



2TS5" 
Cycmaw(Cot) |j7ff 

820 
620 

im 



T2B" 



Speedup 



with data duplicaUon+reordering 24 
10 with data duplIcation-H^edrciering 52 



5 Even though speedup is reasonabl e it gets obviouo that^ fetching the input data from a single 
IRAM (which means that we4iav eit is required to feed the eight inputs in eight cycles before 
processing' is .started) reduces the potential speedup significantly. AA^In other words-we 
have , there is a pipeline that is able to process eight input values per cycle^ but wo arc loading 
the pipeline is loaded only every eighth cycle. This causes that only every eighth pipeline 

10 stage is filled. The figure b e lo wF ig. 51 illustrates thisr^ 

Full utilization can be achieved only by loading the eight input values of the pipeline in one 
cycle. A single-solution to improve the memory throughput to the pipeline is data 
duplication as described i aunder the hnrdwar e noction. h eading "H a rdware." 

15 

Instead of loading the six 8x8 blocks to a single TR AM-^we-Hse. in an embodiment of the 
present invention, the XPPPreloadMultiple command mav be used t o load the eight IRAMs 
with the same contents: 

XPPPreload(0, block, 64*6) ; //IRAMO gets 64 words from 

20 6 blocks 
is changed to: 

XPPPreloadMultiple (OxFF, block, 64x6) // load RAMO up to IRAM7 
with blocks 
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Now the pipeline gets ftiUy utilized and wo also have to store eight results per cycle must be 
stored . This can be achieved by writing every output value- to another IRAM^ which 
additionally takes eight more TRAMs^^usin g (Using data duplication in this example 
needs requires all 16 IRAMs of the XPP64.) For storing the data that is generated with 
IDCTROW_CONFIG we have to change: 

XPPPreloadClean (1, block, 64 * 6) ; / / erase IRAMl and 

assign to the 6 blocks 
to: 

tmpsize = 64*6/8; 

XPPPreloadClean (8, block+0*tmpsize , tmpsize); //IRAM8 for 
interm. Rslt 1 

XPPPreloadClean { 9 , block+l*tmpsize, tmpsize); //IRAM9 for 
interm. Rslt 1 

XPPPreloadClean (10, block+2*tmpsize , tmpsize); // IRAMIO 
for interm. Rslt 1 

XPPPreloadClean (11, block+3*tmpsize, tmpsize); //IRAMll for 
interm. Rslt 1 

XPPPreloadClean (12 , block+4*tmpsize , tmpsize); // IRAM12 

for interm. Rslt 1 

XPPPreloadClean (13 , block+5*tmpsize, tmpsize); // IRAM13 
for interm. Rslt 1 

XPPPreloadClean (14 , block+6* tmpsize , tmpsize); //IRAM14 for 
interm. Rslt 1 

XPPPreloadClean (15, block+7* tmpsize , tmpsize); //IRAM15 for 
interm. Rslt 1 

This causes different data layouts for the intermediate results. Wo need anA n additional 
configuration (T^ROTgDRR rONFIG Y as shown in Fig. 52. mav be needed to restore the 
original data layout. 
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Again, address generation has to be modified to fetch eight input values per cycle. This, on 
the one hand, requires seven additional adders, but, on the other hand, avoids swaps and 
latches for keeping the data eight cycles. 

Data duplication and data reorderingjnay finally transforms the example code to: 
// transform c 

block = blocks [ k* 6 ] ; 
XPPPreload (IDCTROW^CONFIG) ; 

XPPPreloadMultiple (OxFF, block, 64x6 ) // load IRAMO up to 
IRAM7 with blocks 

tmpsize = 64 * 6/8; // result gets seperated into 8 IRAMs 
XPPPreloadClean ( 8 , block+0* tmpsize , tmpsize); // IRAM 8 
tmpsize) ; for interm. Rslt 1 

XPPPreloadClean ( 9 , block+l*tmpsize, tmpsize); // IRAM 9 
tmpsize) ; for interm. Rslt 1 

XPPPreloadCleandO, block+2* tmpsize , tmpsize); // IRAM 10 
tmpsize) ; for interm. Rslt 1 

XPPPreloadCleandl, block+3*tmpsize , tmpsize); // IRAM 11 
tmpsize) ; for interm. Rslt 1 

XPPPreloadClean(12, block+4*tmpsize , tmpsize); // IRAM 12 
tmpsize) ; for interm. Rslt 1 

XPPPreloadClean(13, block+5*tmpsize , tmpsize); // IRAM 13 
tmpsize) ; for interm. Rslt 1 

XPPPreloadClean(14, block+6*tmpsize, tmpsize); // IRAM 14 
tmpsize) ; for interm. Rslt 1 

XPPPreloadClean (15, block+7*tmpsize , tmpsize); // IRAM 15 
tmpsize) ; for interm. Rslt 1 

XPPExecute(IDCTROW_CONFIG, IRAM (0-7) , IRAM (8-15) ) ; 
XPPPreload (IDCOLUMN_CONFIG) ; 
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XPPPreloadMultiple (OxFF, block, 64x6) / / Id IRAM0-IRAM7 with 
interm. Rslt 1 

XPPPreloadClean ( 8 , block+0 *tmpsize, tmpsize) ; // IRAM8 for 

interm. Rslt 2 

XPPPreloadCleanO, block+1 *tmpsize, tmpsize); // IRAM9 for 
interm. Rslt 2 
XPPPreloadClean (10, 
interm. Rslt 2 
XPPPreloadClean (11, 
interm. Rslt 2 
XPPPreloadClean ( 12 , 
interm. Rslt 2 
XPPPreloadClean ( 13 , 
interm. Rslt 2 
XPPPreloadClean ( 14 , 
interm. Rslt 2 
XPPPreloadClean (15, 
interm. Rslt 2 

XPPExecute (IDCOLUiy[N_CONFIG, IRAN (0-7) , IRAM(8-15) ) ; 
XPPPreload ( REORDER_CONFIG) ; 

XPPPreloadMultiple (OxFF, block, 64x6) / /Id IRAM0-IRAM7 with 

interm. Rslt 2 

rsltsize = 64; // 64*6/6; 

XPPPreloadClean ( 8 , block+0 *rsltsize, rsltsize); // IRAM8 for 
final Rslt 

XPPPreloadClean (9, block+1 *rsltsize, rsltsize); // IRAM9 for 
final Rslt 

XPPPreloadClean (10, block+2 *rsltsize, rsltsize); // IRAMIO 
for final Rslt 

XPPPreloadClean (11, block+3 *rsltsize, rsltsize); // IRAMll 

for final Rslt 

XPPPreloadClean (12, block+4 *rsltsize, rsltsize); // IRAM12 
for final Rslt 



block+2 *tmpsize, tmpsize) ; // IRAMIO for 

block+3 *tmpsize, tmpsize) ; // IRAMll for 

block+4 *tmpsize, tmpsize) ; // IRAM12 for 

block+5 *tmpsize, tmpsize) ; // IRAM13 for 

block+6 *tmpsize, tmpsize) ; // IRAM14 for 

block+7 *tmpsize, tmpsize) ; // IRAM15 for 
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XPPPreloadClean(13, block+4 *rsltsize, rsltsize) ; // IRAM13 

for final Rslt 

XPFExecuteClDCOLUMN^CONFIG, IRAM(0-7) , IRAM(8-13) ) ; 

5i6 Wavelet 

S^^J — Original Code 
void forward ^wavelet () 

{ 

int i, nt, *dmid; 

int * sp , *dp, d_tmpO, d_tmpl, d_tmpi, s_tmpO, s_tmpl; 
int mid, ii; 
int *x; 

int s [256] , d[256] ; 

for (nt=COL; nt>=BLOCK_SIZE; nt>>=l) { 
for {i=-0; i <nt *COL /* tmp _nt * /; i+=COL) { 
X = & int_data [i] ; 
mid= {nt»l) -1; 

s[0] = x[0] ; 

d[0] = x[ROW] ; 
S[l] = x[2] ; 

s [mid] = x[2*mid]; 

d[mid] = x[2*mid+R0W] ; 

d[0] = (d[0]<<l) -s[0] -s[l] ; 
s[0] = s[0] + (d[0] >>2) ; 

d_tmpO = d [0] ; 
s_tmpO = s [1] ; 

for (ii=l; ii<mid; ii++) 
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s_tmpl = x[2*ii+2]; 

d_tmpl = ( (x [2*ii+ROW] ) <<1) - s_tmpO - s_tmpl ; 
d[ii] = dtmpl; 

s [ii] = s_tmpO+ ( (d__tmpO+d_tmpl) >>3) ; 
d_tmpO = d_tmpl; 
s^tmpO = s_tmpl ; 

} 

d[mid] = (d [mid] -s [mid] ) <<1; 
s [mid] = s [mid] +( (d [mid-1] +d [mid] ) >>3) ; 



for (ii=0; ii<=mid; ii==) { 
X [ii] = s [ii] ; 
x[ii+mid+l] = d[ii]; 

} 

} 

for (i=0; i<nt; i++) { 

x = &int_data [i] ; 
mid = (nt>>l) -1; 
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S [0] = x[0] ; 

d[0] = x[COL] ; 

s[l] = x[COL<<l] ; 

s [mid] = X [ (COL«l) *mid] ; 

d[mid] = x[ (C0L<<1) *mid +COL] ; 

d[0] = (d[0]«l) -s[0] -s[l] ; 

s[0] = s[0] + (d[0]>>2) ; 

d__tmpO = d[0] ; 
s_tmpO = s [1] ; 
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for (ii=l; ii<mid; ii++) { 
s_tmpl = x[2*COL* (ii+1) ] ; 

d_tmpl = (x[2*C0L*ii+C0L] <<1) -s_tmpO-s_tmpl; 
d[ii] = d_tmpl; 

s [ii] = s__tmpO+ ( (d_tmpO+d_tmpl) >>3) ; 
d_tmpO = d_tmpl; 
s_tmpO = s_tmpl; 

} 

d[mid] = (d[mid] <<1) - (s [mid] <<1) ; 
s[mid] = s [mid] + ( (d[mid-l] +d[mid] ) >>3) ; 

for (ii=0; ii<=mid; ii++) { 
x[ii*COL] = s [ii] ; 
x[ (ii+mid+1) *COL] =d[ii]; 

} 

} 

} 

} 

SS^ — Optimizing the Whole Loop Nest 

After pre-processing and application of copy propagation over sjmpl, djmpl , wo obtain the 
following loop nes t mav be obtained: 

void forward _wavelet ( ) 

{ 

int i/ nt, *dmid; 

int * sp, *dp, d_ tmpC, d _tmpl, d ^trnpi, s_tmpO, s_tmpl; 

int mid, ii; 
int *x; 

int s [256] , d[256] ; 

for (nt=64; nt>= 16; nt^>>=l) { 
for (i=0; i<nt*64; i+=64) { 
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X = &int_data [i] ; 
mid = (nt>>l) -1 ; 

s [0] = x[0] ; 

d[0] = x[l] ; 

s[l] = x[2] ; 

s [mid] = x[2*mid] ; 

d[mid] =x[2*mid+l]; 

d[0] = (dtO]<<l)-s[0]-s[l] ; 
s [0] = s [0] + (d[0] »2) ; 

d_tmpO = d[0] ; 
s_tmpO = s [1] ; 

for (ii=l; ii<mid; ii++) { 

d[ii] = ( (x[2*ii+l] ) «1) - s_tmpO - x[2*ii+2]; 
s[ii] = s_tmpO + ( (d_tmpO + d[ii])>>3); 
d_tmpO = d [ii] ; 
s_tmpO = s [ii] ; 

} 

d[mid] = (d [mid] -s [mid] ) <<1 ; 

s[mid] = s [mid] + ( (d[mid-l] +d[mid] ) >>3) ; 

for (ii=0; ii<=mid; ii++) { 

X [ii] = s [ii] ; 
x[ii+mid+l] = d[ii]; 

} 

} 

for (i=0; i<nt; i++) { 
X = &int data [i] ; 
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mid = (nt>>l) -1 ; 



s[0] = x[0] ; 

d[0] = xt64] ; 

s[l] = x[128] ; 

s [mid] = x[128*mid]; 

d[mid] = x[128*mid+64] ; 

d[0] = (d[0] <<1) -s [0] - s [1] ; 

s[0] = s[0] + (d[0]>>2) ; 



d_tmpO = d [0] ; 
s_tmpO = s [1] ; 



for (ii=l; ii<mid; ii++) { 

d[ii] = (x[128*ii+64]<<l) - s_tmpO - x [128* (ii+1) 1 ; 
s[ii] = s_tmpO + ( (d_tmpO + d[ii])>>3); 
d_tmpO = d [ii] ; 
s_tmpO = s [ii] ; 

} 

d[mid] = (d[mid]<<l) - (s [mid] <<1) ; 
s[mid] = s[mid] + ( (d [mid-1] +d [mid] ) >>3) ; 



for (ii=0; ii<=mid; ii++) { 

X [ii*64] = s [ii] ; 

x[ (ii+mid+1) *64] =d[ii]; 

} 

} 
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} 

} 



Thnn lira hnvp 1 tfihlon. one Below is a table for each innennost loop. The tables for the first 
5 and the third loops are^ identical, as are the tables for the second and the fourth loop. W e 
hnvn thP. fnllnwinp two loops. Accordingly. 2 tables are presented below. 



Parameter 


Value 


Vector length 


mid-2 


Reused data set size 




I/O IRAMs 


6 


ALU 


6 


BREG 


0 


FREG 


2 


Data flow graph width 


2 


Data flow graph height 


6 


Configuration cycles 


6+(mid-2) 






Parameter 


Value 


Vector length 


mid 


Reused data set size 




I/O IRAMs 


6 


ALU 


0 


BREG 


0 


FREG 


0 


Data flow graph width 


2 


Data flow graph height 


1 


Configuration cycles 


mid 



The two inner loops do not have the same iteration range and could be candidat e candidates 
1 0 for loop fiisio n. Therefore, thoroforo the first and last iterations of the .second loop afemay be 
peeled off. The^- surrounding code between the 2 loops can be moved to_after the second 
loo p. Accordingly, then wo obtcdn t he following code for the loop nest may be obtained. 

for (nt=64; nt>= 16; nt>>=l) { 
15 for (i=0; i<nt*64; i+=64) { 



NY01 1113556V1 



MARKED-UP VERSION OF THE 
SUBSTITUTE SPECIFICATION 



X = Scint^data [i] ; 
mid = (nt>> 1) -1 ; 

s [0] = x[0] ; 
d[0] = x[l] ; 
S[l] = x[2] ; 

s [mid] = x[2*mid] ; 
d[mid] = x[2*mid+l] ; 

d[0] = (d[0] <<1) -s [0] -s [1] ; 
s[0] = s[0] + (d[0] »2) ; 

d_tmpO = d[0] 
s_tmpO = s [1] ; 

for (ii=l; ii<mid; ii++) { 

d[ii] = ( (x[2*ii + l] ) <<1) - s_tmpO - x[2*ii + 2]; 
s[ii] = s_tmpO+ ( (d_tmpO + d[ii])>>3); 
d_tmpO = d [ii] ; 
s_tmpO = s [ii] ; 

} 

for (ii=l; ii<mid; ii++) ( 
x [ii] = s [ii] ; 
x[ii+mid+l] = d[ii]; 

} 

d[mid] = (d [mid] -s [mid] ) <<1; 

s [mid] = s [mid] + ( (d [mid-1] +d [mid] ) >>3) ; 

x[0] = s [0] ; 
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x[mid+l] = d[0] ; 

X [mid] = s [mid] ; 
x[2*mid+l] = d[mid]; 

} 

for (i=0; i<nt; i++) { 
X = Scint^data [i] ; 
mid = (nt>>l) -1 ; 

s [0] = x[0] ; 

d[0] = x[64] ; 
s[l3 = x[128] ; 
s[mid] = x[128*mid] ; 
d[mid] = x[128*mid +64]; 

d[0] = (d[0] <<1) -s [0] "S [1] ; 
s[0] = s[0] + (d[0]>>2) ; 

d__tmpO = d[0] ; 
s_tmpO = s [1] ; 

for (ii = l; ii<Tnid; ii++) { 

d[ii] = (x[128*ii+64] <<1) - s_tTnpO - x [128* (ii + 1) ] ; 
s[ii] = s_tmpO + ( (d_tmpO+d_tmpl) >>3) ; 
d_tmpO = d [ii] ; 
s_tmpO = s [ii] ; 

} 
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for (ii = l; ii<tnid; ii++) { 



x[ii*64] = s [ii] ; 
x[ (ii+mid+1) *64] =d[iil; 

} 

5 d[mid] = (d[mid]<<l) - (s [mid] <<1) ; 

s [mid] = s [mid] + ( (d [mid-1] +d [mid] ) »3) ; 

x[0] = s [0] ; 
x[ (mid+l) *64] = d[0] ; 
10 x[mid*64] = s [mid] ; 

x[ (2*mid+l) *64] =d[mid]; 

} 

} 

1 5 After loop peelings the only change e awith respect to the parameters is the vector length. 
^Fh eAccordinglv. the tables bee^a eare changed to the following: 



Parameter 


Value 


Vector length 


mid-2 


Reused data set size 




yOlRAMs 


6 


ALU 


2 


BREG 


0 


FREG 


2 


Data flow graph width 


2 


Data flow graph height 


6 


Configuration cycles 


6+(mid-2) 




Parameter 


Value 


Vector length 


mid-2 


Reused data set size 




I/O IRAMs 


6 


ALU 


0 


BREG 


0 


FREG 


0 
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Data flow graph width 


2 


Data flow graph height 


1 


Configuration cycles 


mid-2 



The fusion of the inner loops is legal as there would be no loop-carried 
dopondonces dependencies between the instructions formerly in the second loop and the 
instructions formerly in the first loop. Wo obtain the The following loop nest may be 
obtained . 

for (nt=64 ; nt>= 16; nt >>=1) { 

for (i=0; i<nt*64 /*tmp_nt*/; i+=64) { 

X = &int— _data [i] ; 
mid = (nt>>l) -1 ; 

s[0] = x[0] ; 

d[0] = x[l] ; 

s[l] = x[2] ; 

s [mid] = x[2*mid] ; 

d[mid] = x[2*mid+l]; 

d[0] = (d[0] <<1) -s [0] -s [1] ; 
s[0] = s[0] + (d[0]>>2) ; 

d_tmpO = d [0] ; 
s_tmpO = s [1] ; 

for (ii=l; ii<mid; ii++) { 

d[ii] = ( (x[2*ii + l] ) «1) - s_tmpO - x[2*ii+2]; 
s[ii] = s_tmpO + { (d_tmpO+d [ii] ) >>3) ; 
d_tmpO = d [ii] ; 
s_tmpO = s [ii] ; 
x[ii+mid+l] = d[ii]; 

} 
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d[mid] = (d [mid] -s [mid] ) <<1; 

s [mid] = s [mid] + ( (d[mid-l] +d[mid] ) >>3) ; 

x[0] = s [0] ; 

x[mid+l] = d[0] ; 
X [mid] = s [mid] ; 
x[2*mid+l] = d[mid]; 

} 

for (i=0; i<nt; i++) { 

x = &int_data [i] ; 
mid = (nt>>l) -1 ; 

s [0] = x[0] ; 

d[0] = x[64] ; 

S [1] = x[128] ; 

s [mid] = x[128*mid]; 

d[mid] = x[128*mid+64] ; 

d[0] = (d[0]«l) -s[0] -s[l] ; 
s[0] = s[0] + (d[0] >>2) ; 

d_tmpO = d[0] ; 
s_tmpO = s [1] ; 

for {ii=l; ii<mid; ii++) { 
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d[ii] = (x[128*ii+64]«l) - s_tmpO - x [128* (ii + 1) ] ; 

s[ii] = s_tmpO + ( (d_tmpO + d[ii])>>3); 

d_tmpO = d [ii] ; 

s_tmpO = s [ii] ; 

x[ii*64] = s [ii] ; 

x[ (ii+mid+1) *64] = d[ii]; 

} 

d[tnid] = (d[mid] <<1) - (s [mid] <<1) ; 
s[mid] = s [mid] + ( (d [mid-1] +d [mid] ) >>3) ; 

x[0] = s [0] ; 

x[ (mid+1) *64] = d[0] ; 

x[mid*64] = s [mid] ; 

x[ (2*mid+l) *64] =d[mid]; 

} 

} 

After loop fusion, v ^there are only have-two loops, Aatand thev have the following same 
parameter table. 



Parameter 


Value 


Vector length 


mid-2 


Reused data set size 




I/O IRAMs 


8 


ALU 


6 


BREG 


0 


FREG 


2 


Data flow graph width 


2 


Data flow graph height 


6 


Configuration cycles 


6+(mid-2) 



When performing value range analysis, the compiler finds that nt ranges takestake the values 
64, 3532. and 16. The upper bound of the inner loops is mid, which depends on the value of 
nt. The analysis fmds then that can take the valuesf^4^.1L15i and 7. Loops with 
constant loop bounds can be handled more efficiently on the PACT XPP. This means that the 
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inner loops can be better optimized if mid is replaced by a constant value. This will happen 
when the outer loop is unrolled. This way wo will obtain a bigger , a larger set of code.wilLbe 
obtained , but with 3 instances of the loop nest, each being_a candidate for a configuration. 
This can be seen as a kind of temporal partitioning. Thus^ the outer loop is completely 
unrolled giving six new loop nests. 

for {i=0; i<4096; i+=64) { / *nt=64 */ 
X = &int_data [i] ; 
mid=31 ; 



s[0] 


= x[Ol ; 


d[0] 


= x[l] ; 


s[l] 


= x[2l ; 


s[31] 


= x[61] ; 


d[31] 


= x[63]; 



d[0] = (d [0] <<1) -s [0]-s [1] ; 
s[Ol - s[0] + (d[0]>>2) ; 



d_tmpO = d [0] ; 
s_tmpO = s [1] ; 

for (iil; ii<31; ii++) { 

d[ii] = ((x [2*ii+l]) «1) - s_tmpO - x [2*ii+2] ; 

s[ii] = s_tmpO+((d _tmpO + d[ii] )>>3); 

d _tmp 0 = d [ ii ] ; 

s_tmpO = s [ii] ; 

x[ii] = s [ii] ; 

x[ii + 32] = d[ii] ; 

} 

d[31] = (d [31] -s [31] ) <<1; 

s [31] = s [31] + ((d[30]+d [31] ) >>3) ; 

x[0] = s [0] ; 
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x[32] = d[0] ; 
x[31] = s[31] 
x[63] = d[31] ; 

} 

for (i=0; i<64; i++) { 

X = &int_data[i] ; 
mid = 31; 

s[0] = X [0] ; 
d[0] = X [64] ; 
s[l] = x[128]; 
s [31] = x[3968] ; 
d[31] = x[4032] ; 

d[0] = (d[0]«l) -s[0] -s[l] ; 
s [0] = s [0] + (d[0] > >2) ; 

d_tmpO = d[0] ; 



s_tmpO = s [1] ; 

for (ii=l; ii<31; ii++) { 

d[ii] = (x [128*ii+64] <<1) - s_tmpO - x[128* (ii+-l) ] ; 

s[ii] = s_tmpO + ( (d _tmpO + d[ii]) >>3) ; 

d_ttnpO = d [ii] ; 

s_tmpO = s [ii] ; 

x[ii*64] = s[ii] ; 

x[ (ii+32) *64] = d[ii] ; 

} 

d[31] = (d[31]«l ) - (s[31]«l); 

s [31] = s[31] + ( (d[30]+d[31] )»3); 



NY01 1113556V1 



192 



MARKED-UP VERSION OF THE 
SUBSTITUTE SPECIFICATION 



x[0] = s [0] ; 

x[2048 ] = d[0] 

x[1984 ] = s[31] 

x[4032 ] = d[31] 

} 



for (i=0; i<2048; i+=64) { /*nt = 32 */ 

X = &int _data [i] ; 

mid =15 ; 

s [0] = x[0] ; 

d[0] = x[l] ; 

s [1] = x[2] ; 

s [15] = x[30] ; 

d[15] = X [31] ; 

d[0] = (d [0]-<<l) -s [0] -s [1] ; 
s [0] = s [0] + (d[0] >>2) ; 

d_tmp 0 = d [0] ; 
s_tinpO = s [1] ; 

for (ii=l; ii<15; ii++) { 

d[ii] = ((x[2*ii + l]) «1) - s_tmpO - x [2* ii+2] 
s [ii] = s_tmpO + ( (d_tmpO+d [ii] ) »3) ; 
d_tmpO = d[ii] ; 
s_tmpO = s [ii] ; 
x[ii]-= = s [ii] ; 



x[ii+16] = d[ii] ; 
} 

d[15] = (d[15] -s[15] )«1; 
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s[15] = s[15] + ( {d[14]+d[15] ) »3) ; 

xtO] = s[0] ; 
x[16] = d[0] ; 
x[15] = s [15] ; 
x[31] = d[15] ; 

} 

for (i=0; i<32; i++) { 

X = &int_data [i] ; 
mid = 15; 

s [0] = x[0] ; 
d[0] = x[64] ; 
s [1] = x[128] ; 
s[15] = x[1920] ; 
d[15l = x[1984] ; 

d[0] = (d[0] <<1) -s [0] -s [1] ; 
s [0] = s [0] + (d [0] »2) ; 

d_tmpO = d [0] ; 
s_tmpO = s [1] ; 

for (ii=l; ii<15; ii++) { 

d [ii] = (x [128*ii+64] <<1) - s_tmpO - x [128* (ii+1) ] ; 

s[ii] = s_tmpO + ( (d_tmpO+d[ii] ) >>3; 

d_tmpO = d [ii] ; 

s_tmpO = s [ii] ; 

x[ii*64] = s [ii] ; 

x[ (ii + 16) *64] = d[ii] ; 

} 

d[15l = (d[15]<<l)-(s[15] <<1) ; 
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s[15] = s[15] + ( {d[l4]+d [15] ) »3) ; 

x[0] = s [0] ; 
x[1024] = d [0] ; 
x[960] = s [15] ; 
x[1984] = d[15] ; 

} 

for (i=0; i<1024; i+=64) { /* nt = 16*/ 

X = &:int_data [i] ; 
mid = 7 ; 

s[0] = x[0] ; 
d[0] = x[l] ; 
s [1] = x[2] ; 
S [7 ] = x[14] ; 
d[7 ] = x[15] ; 

d[0] = (d[0] <<1) -s [0] -s [1] ; 
s [0] = s [0] + (d[0] >>2) ; 



d_ tmpO = d[0] ; 
s_tmpO = s [1] ; 

for (ii=l; ii<7; ii++) { 

d [ii] = ( (x [2*ii+l]) «1)- s_tmpO - x[2*ii+2] ; 

s[ii] = s_tmpO+ { (d_tmpO+d [ii] ) »3) ; 

d_tmpO = d [ii] ; 

s_tmpO = s [ii] ; 

X [ii] = s [ii] ; 

x[ii+8] = d[ii] ; 
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) 

d[7] = {d[7]-s [7]) <<1; 

S[7] = s[7] + ((d[6]+d[7])>>3) ; 

x[0] = s [0] ; 
x[8] = d[0] ; 
x[7] = s[7] ; 
x[15] = d[7] ; 

} 

for (i=0; i<16; i++) { 

X = &int _data [i] ; 
mid = 7 ; 

s [0] = x[0] ; 
d[0] = x[64] ; 
s [1] = x[128] ; 
S [7] = x[896] ; 
d[7] = x[960] ; 

d[0] = (d[0]«l) -s[0] -s[l] ; 
s[0] = s[0] + (d[0]>>2) ; 

d __tmpO = d[Ol ; 
s_tmpO = s [1] ; 

for (ii=l; ii<7; ii++) { 
d[ii] = (x[128*ii+64] <<1) - s_tmpO - x [128* (ii+1) ] ; 
s [ii] = s_-tmpO + ( (d_tmpO+d [ii] )>>3) 
d_tmpO = d [ii] ; 
s_tmpO = s [ii] ; 
x[ii*64] = s [ii] ; 
x[ (ii + 8) *64] = d[ii] ; 
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} 

d[7] = (d[7]<<l) - (s [7]«1); 
s [7] = s[7] + ( (d[6] +d[7] ) >>3) ; 

5 x[0] = s[0] ; 

x[512] = d[0] ; 
X [448] = S [7] ; 
x[960] = d[7] ; 

} 

10 

In the parameter table, the vector length is the only value that change. Wo give it chMiges, 
Below is a parameter table for the first two loops. To deduce the table for the other loops, the 
vector length has to be set to 14 and 66^ respectively. 



Parameter 


Value 


Vector length 


30 


Reused data set size 




I/O IRAMs 


8 


ALU 


6 


BRED 


0 


FREG 


2 


Data flow graph width 


2 


Data flow graph height 


6 


Configuration cycles 


6+30=36 



15 5s6;3 — Optimizing the Inner Loops 

The efforts are then concentrated on the six inner loops. In fact, if wo look at thorn, th e y They 

all need 2 input data and 4_output 4-data. 2 more data are needed for the first iteration. 

Hence we need^ at most^ IRAM s are required for the first iteration and 6 for the others. 

This means that ¥» ^the loops can unroll tho loops b e unrolled twice nooding, requiring 14 
20 IRAMs for one iteration of the new loop bodies. Below wo pr e s e ntare presented only the 

unrolled inner loops for commodit)^ reasons. ^ 

FifS tThe first loop mav be as follows : 
for (ii =1; ii<31; ii=ii+2) { 
25 d[ii] = ({x[2* ii + 1] ) <<1) - s_tmpO - x[2*ii + 2l; 
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s [iil = s_tmpO + ( (d_tmpO+d [ii] ) >>3) ; 



d_tmpO = d[ii] ; 
s_tmpO = s [ii] ; 
X [ii + 1] = s [ii] ; 
x[ii+33] = d[ii] ; 

d[ii+l] = ((x[2* (ii+1) +1] ) <<1) - s_tmpO - x [2* (ii+1) +2] ; 

s [ii+1] = s_tmpO + ( (d_tmpO+d [ii+1] ) >>3) ; 

d_tmpO = d [ii+1] ; 

s_tmpO = s [ii+1] ; 

X [ii + 1] = s [ii + 1] ; 

x[ii+33] = d[ii+l] 

} 

Seeea dThe second loop mav be as follows : 
for (ii=l; ii<31; ii=ii+2) { 

d[ii] = (x[128*ii+64] <<1) - s_ tmpO - x [128* (ii+1) ] ; 



stii] = s_tmpO + ( (d_tmpO+d [ii] ) >>3) ; 



d_tmpO = d [ii] ; 

s_tmpO = s [ii] ; 

x[ii*64] = s [ii] ; 

x[ (ii + 32) *64] = d[ii] ; 

d[ii + l] = (x[128* (ii + l)+64] <<1) - s_tmpO - x[128* 
(ii + 2) ] ; 

s[ii+l] = s_tmpO + ( (d_tmpO+d[ii+l] ) >>3) ; 

d_tmpO = d[ii+l]; 

s_tmpO = s [ii+1] ; 

x[ (ii+1) *64] = s [ii+1] ; 

x[ (ii+33) *64] = d[ii + l]; 

} 
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Thifd The third loop may be as follows : 
for {ii=l; ii<15 ; ii=ii+2) { 

d[ii] = ( (x [2*ii+l]) <<1) - s _tmp 0 - x [2*ii+2] ; 
5 s-[ii] = s_tmpO + ( (d__tmpO+d [ii] ) >>3 ) ; 

d_tmpO = d [ii] ; 

s_tmpO = s [ii] ; 

x[ii] = s [ii] ; 

x[ii + 16] = d[ii] ; 
10 d[ii+il] = ( (x[2* (ii+1) +1] ) <<1) - s_tmpO - x [2* (ii+1) +2] ; 

s [ii+1] = s_tmpO + ( (d_tmpO+d [ii+1] ) >>3) ; 

d_tmpO = d[ii+l]; 

s_tTnpO = s[ii + l]; 

X [ii + 1] = s [ii + 1] ; 
15 x[ii + 17] = d [ii + 1] ; 

} 

Fewt hThe fourth loop mav be as follows : 
for (ii=l; ii<15 ; ii=ii+2) { 
20 d[ii] = (x[ 128*ii+64] <<1) - s_tmpO - x [128* (ii+1) ] ; 

s[ii] = s tmpO + ( (d_tmpO + d[ii]) >>3) ; 

d_tTnpO = d [ii] ; 

s_tmpO = s [ii] ; 

x[ii*64] = s [ii] ; 
25 x[(ii + 16) *64] =d[ii]; 

d[ii+l] = ( x[128*-(ii+l) +64] <<1) - s_tmpO - x[128* (ii+2)]; 

s [ ii ] = s_tmpO + ( (d_tmpO+d [ii+1] ) >>3) ; 

d_tmpO = d [ii+1] ; 

s_tmpO = s[ii+l]; 
30 x[(ii + l) * 64 ] = s-[ii + l]; 

x[(ii+17) *64 = d[ii+l] ; 

} 
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Fift hThe fifth loop mav be as follows : 
for (ii= 1; ii<7; ii=ii+2) { 

d[ii] = ( (x [2*ii + l]) <<1) - s_tTnpO - x [2*ii+2] ; 
s[ii] = s_tmpO + ( (d_tmpO+d [ii] ) >>3) ; 
5 d_tmpO = d[ii] 

s_tmpO = s [ii] ; 
x[ii] = s [ii] ; 
x[ii + 8] = d[ii] ; 

d[ii+l] = ( (x[2* (ii+1) +1] ) <<1) - s_tmpO - x [2* (ii+1) +2] ; 
10 s[ii + l] = s_tmpO + ( (d_tmpO+d [ii + 1] ) >>3) ; 

d_tmpO = d[ii+l] ; 
s_tTnpO = s [ii + 1] ; 
x [ii+1] = s [ii + 1] ; 
x[ii + 9] = d[ii + l] ; 

15 } 

Sij ^The sixth loop mav be as follows : 
for (ii=l; ii<7; ii=ii+2) { 

d[ii] = (x[128*ii+64] <<1) - s_tmpO - x[128*(ii+ 1) ] ; 
20 s[ii] = s_tTnpO + ( (d _tmpO + d[ii])>>3); 

d_tmp 0 = d [ii] ; 

s__tmpO = s[ii] ; 

X [ii*64 ] = s [ii] ; 

x[ (ii+8) *64] = d[ii] ; 
25 d[ii+l] = (x[128* {ii+l)+64 ] <<1) - s_tmpO - x [128* (ii+2) ] ; 

s [ii] = s_ tmpC + ( (d_-tmpO+d[ii + l] ) >>3) ; 

d _tmpO = d[ii + l] ; 

s_tmpO = s [ii + 1] ; 

x[(ii+l)*64] = s[ii+l]; 
30 x[(ii+9) *64] =d[ii+l]; 

} 

Wo nhtain tho following Fig. 53 is a dataflow graph of these loop bodies after a step of tree 
balancing has been performed. We roprosont hero only tho graph corrospondingThe dataflow 
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gra ph of Fig. 53 corresponds to the first loop. To obtain the graphs for the other loops, only 
the input and output data need to be changed. 

Each input and output data will occupy an IRAM. dO and sO will be the only values in their 
IRAM, enabling then-the merge operations to select between resp. O^nds O at the first 
iteration and the feedback values for the other iterations. Once the pipeline is filled, 8 values 
can be output in a cycle, corresponding to 4 values for array x. The same configuration is 
used- for all loops; only the data in the IRAMs differ. Wo give now Below are resuh tables 
for only fef-the 2 first loops. The tables for the other taWeslooES are the same. 

For the first two loops wo obtain^ the following table is obtained , and the expected speedup 
with respect to a standard superscalar processor with 2 instructions issued per cycle is 1 5.3. 



Parameter 


Value 


Vector length 


30 


Reused data set size 




I/O IRAMs 


14 


ALU 


12 


BREG 


0 


FREG 


2 


Data flow graph width 


2 


Data flow graph height 


10 


Configuration cycles 


10+15=25 




Ops 


Number 


LD/ST (2 cycles) 


14 


ADDRCOMP (1 cycle) 


2 


ADD/SUB (1 cycle) 


17 


MUL (2 cycles) 


0 


SHIFT (1 cycle) 


4 


Cycles per iteration 


51 


Cycles needed for the loop (2- way) 


(51*15)/2=383 



Roforences 



Tho limitations of conventional procossorG aro becoming more and more e vident. The 
growing importanc e of stream baGod applications moicos coarao grain d)iiamically 
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rcconfigurablc architccturos an attractivo altomativo [3], [1], [6], [7]> Thoy combino the 
performance of ASICs, which arc vor>^ risky and o}cponDivo (dovolopmont and maolc costs), 
with the floxibiUty of traditional processors [5]> 

In spito of the posoibilitios wo have today in VLSI dovolopmont, tho basic concopts of 
microprocessor architectures are th e sam e as 20 years ago. The main procossing imit of 
modem conventional microprocessors, tho datapath, in its actual structure follows the sam e 
style guidelines as its predecessors. Although the developm e nt of pipelined architectur e s or 
superscalar concepts in combination with data and instruction caches incr e ases the 
performance of a modem microprocessor and allows higher frequency rates, the main concept 
of a Gtatio datapath remains. Therefore, each operation is a composition of basic instruction s 
that the used processor owns. Tho benefit of the procossor concept lays in the abilit>^ of 
executing strong control dominant application. Data or stream oriented applications aro not 
well suit e d- for this cnvironmont. The sequential instruction execution isn't the right targ e t 
for that kind of applications and needs high band width because of permanent retransmitting 
of instruction/data fi-om and to momor>^ This handicap is often eased by using of caches in 
various stages. A s e quential intercormection of filt e rs, w^hich do tho according data 
manipulating without wiiting back the intermediate results would get th e right optimisation 
and reduction of bandwidth. Practically, this kind of chain of filters should bo constructed in 
a logical way and configured during runtime. Existing approach to extend instruction set s 
use static modules, not modifiable during runtime. 

Customized microprocessors or ASICs aro optimized for one sp e cial application 
environment. It is n e arly impossible to use the same microproc e ssor core for another 
application without loosing th e performance gain of this architecture. 

A new approach of a flexible and high performance datapath concept is needed, which allows 
to reconfigur e th e fiinctionality and malce this core mainly application independent without 
losing the performance needed for stream bas e d applications. 

This contribution introduces a new^ concept of loosely coupled implementation of the 
d>Tiamic reconfigurablc XPP architecture from PACT Corp. into a static datapath of th e 
SPi\RC compatible LEO>J processor. Thus, this approach is different from those, where the 
XPP operat e s as a complet e ly separat e (master) component within one Configurable System 
on Chip (CsoC), together with a processor core, globaiaocal memor)^ topologies and efficient 
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multi lay e r i\inba bus intcrfacos [1 1]. H e ro, from the programmcrQ point of viow tho ox 
tcndod and adaptod datapath G o cmo like a dynamic configurablo inGtruction sot. It can be 
customized for a spocific application and accclorato tho execution onormously. Ther e fore, 
tho programmer has to cr e ate a number of configurations, which can be uploaded to tho XPP 
5 Array at run time, o. g. this configuration can bo used liko a filter to calculate str e am 

oriontod data. It is also posoiblo, to configur e moro than one fiinction in the same tim e and 
uso thom simultaneously. This concept , promis e s on enormously performance boost and th e 
needed flexibility and power reduction to perform a series of applications very effective. 

6^3 1 — LEON RISC Microproce ss or 

10 For implemontation of this concept we chose the 32 bit SPARC V8 compatible 

microproc e ssor [1] [2], LEOl^J. This microprocessor is a synthesisablo, free available VHDL 
model which has a load/store architecture and has a five stag e s pipeline implementation with 
separat e d instruction and data cach e s. 

Figure 66: LEON Architecture Overview 

15 As shown in Figur e 66 tho LEON is provid e d with a fiiU implem e ntation of AMBA 2.0 AHB 
and i^PB on chip bus, a hardware multiplier and devider, programmable 8 /16/32 bit memor>^ 
controller for oxtemal PR.OM, static RAM and SDRi\M and several on chip peripherals such 
as timors, UAP.Ts, interrupt controller and a 16 bit I/O port. A simple power down mode is 
implemented as w e ll. 

20 LEON is d e veloped by the European Space Agency (ESA) for fiituro space missions. The 
performance of LEO>J is close to an ARM19 series but don^t have a momor>^ management 
unit (MMU) implem e ntation, which limits the use to single memory space applications. In 
Figure 67 th e datapath of the LEON integer unit is shown. 

25 Figure 67: LEON Pipelined Datapath Structure 

2. extreme Processing Platform XPP 

The XPP architecture [6], [7], [8] is based on a hierarchical array of coarse grain, adaptive 
computing elements called Processing Array Elements (PAEs) and a packet oriented 
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communication nct\s>'ork. Th e strength of th e XPP tochnology originatos from the 
combination of array proc e ssing with uniqu e , powerful run time roconfigiiration mochanisms. 
Since configuration control is distribut e d over a Configuration Manager (CM) embedded in 
the array, Pi\Es can b e configur e d rapidly in parallel while neighboring PAEs ar e processing 
5 data. Entire applications can be configur e d and run independ e ntly on different parts of the 
array. Reconfiguration is trigg e red extomally or oven by special event signals originating 
within the array, enabling self roconfiguring d e signs. By utilizing protocols implemented in 
hardwar e , data and event packets are used to process, generate, decompose and merge 
streams of data. 

10 The XPP has some similarities with oth e r coarse grain reconfgurable architectures like th e 
KrosSi\rray [3] or Raw Machines [1] which arc specifically designed for str e am based 
applications. XPP's main distinguishing featur e s are its automatic packet handling 
mechanisms and its sophisticated hierarchical configuration protocols for runtime and self 
reconfiguration. 

15 6.2.1 2A Array Structure 

A CM consists of a stat e machin e and internal RAM for configuration caching. The PAC 
itself (see top right w^and side of Figur e 69) contains a configuration bus which connects the 
CM with. Pi\Es and other configurable objects. Horizontal busses carry data and events. 
They can be segmented by configurable switch objects, and connected to PAEs and special 
20 I/O obj e cts at the periphery of the devic e . 

A PAE is a collection of PAE objects. The typical PAE shown in Figure 69 (bottom) 
contains a BPwEG object (back regist e rs) and an FP^G obj e ct (forward registers) which are 
used for vertical routing, as well as an ALU object which performs the actual computations. 
The ALU performs common fpced point arithmetical and logical operations as well as several 

25 special three input opcodes like multiply add, sort, and counters. Events generated by ALU 
objects depend on ALU results or exceptions, vor>^ similar to the state flags of a classical 
microproc e ssor. A counter, e.g., generates a special e vent only after it has terminated. The 
next section explains how those events are used, .\nothcr Pi\E object implemented in th e 
XPP is a memor>^ object which can be used in FIFO mode or as RAM for loolcup tables, 

30 int e rmediate results etc. How^evcr, any Pi\E object fimctionalit>^ can bo included in the XPP 
architectur e . 
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6. 2. 2 



6. 2. 3 25-3 — Pocket Handling ond Synchronization 

PAE objocts aa dofinod abovo communicates via a paolcot oriontod network. Two poo of 
packeto are sent through the array: data packota and event packotp. Data packets have a 

5 uniform bit width specific to the device type. In normal operation mode, Pi\E objects are 
self synohronizi fig7 An operation io perform e d as soon as all necessary data input paok e to 
ore available. The results are forwarded as soon as they are avail able, provided the previous 
results have been consumed. Thus it is possible to map a signal flow graph directly to ALU 
objooto. Event packeto are one bit wide. They transmit state information >vhich controls 

10 ALU ex e cution and packet gen e ration. 

^^irA 25-3 — Configuration 

Figure 69: Structure of anXPP device 

Every PAE atoroo locall>' its current configuration otatc, i.e. if it io part of a configuration or 
not (state o „configurod" or „frco"). Once a PAE io configured, it changes ito otate to 
15 „configured". This prevents the CM from reconfiguring a VAE which io otill used by another 
application. The CM caches the configuration data in its internal Ri\M until the required 
PAEs become availabl e . 

WhiXo loading a configuration, all PAEo start to compute their part of the application as ooon 
as they are in otate ..configured". Partially configured applicationo arc able to proc o os data 
20 without loos of packets. This concurrency of con figuration and computation hides 
configuration latency. 

6. 2. 5 2. 4 XPP Application Mapping 

The T'Jative Mapping Language (T'JML), a PACT proprietar>^ structural language with 
rcconfiguraton primitives, was developed by PACT to map applications to the XPP array. It 
25 gives the programmer direct access to all hard ware f e atur e s. 

In NML, configumtions consist of modules which are specified as in a structural hardwar e 
description language, similar to. for instance, structural VIIDL, P.\E objects are explicitly 
allocated, optionally placed, and their oonnootions specified. Hierarchical modules allow 
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componont r o uGC, ospocially for ropotitivc lay outo. Additionally, >JML includoo otatoments 
to support configuration handling. A completo NML application program conoisto of one or 
more modules, a Gcqucnc e of initially configur e d modul e s, difforontial chang e s, and 
statements which map event signals to configuration and prefetch rcquosts. Thus 
5 configuration handling is an e xplicit part of the application program. 

A completo XPP Development Suite (XDS) is available fi:om PACT. For more details on 
XPP based architectures and development tools s ee [6]. 

6.-3 — 3. LEON Instruction Datapath Exten s ion 

The system is designed to offer a maximum of performanc e . LEON and XPP should be able 
10 to commxmicatc with each oth e r in a simple and high porformanco mann e r, while the XPP is 
a dataflow ori e ntated device, the LEON is a g e neral purpose processor, suitable for handling 
control flow [1], [2]. Thoroforo, LEON is used for system control. To do this, the XPP is 
integrated into the datapath of the LEON int e ger unit, which is able to control the XPP. 

Figure 71: Extended Datapath Overview 

15 Duo to unpredictable operation time of the XPP algorithm, integration of XPP into LEON 
data path is done in a loosely coupled way (Figure 71). Thus the XPP array can operate 
independent from the LEON, which is able to control and reconfigure the XPP during 
runtime. Since the configuration of XPP is handled by LEOI>J, the CM of the XPP is 
unnocossar)^ and can bo lefl: out of the XPP array. The configuration codes are stored in the 

20 LEO>J Ri\M. LEO>J transfers the needed configuration fi-om its system Ri\M into the XPP 
and creates the needed algorithm on the array. 

To enable a ma>cimum of independence of XPP fi-om LEO>J, all ports of the XPP input ports 
as well as output ports are buffered using dual clock FIFOs. Dual clocked FIFOs ar e 
implemented into the lO Ports between LEO>J and XPP. To transmit data to the extended 

25 XPP based datapath the data are passed through an lO Port as shown in Figure 5. In 

addition to the FIFO the lO Ports contain logic to generate handshalco signals and an interrupt 
request signal. The lO Port for receiving data fi-om XPP is similar to Figure 5 except that the 
revcracd direction of the data signals. This enables that XPP can work completely 
independent from. LEO>J as long as there are, input data available in the input port FIFOs 

30 and free space for result date in the output port FIFOs. There are a number of additionally 
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fcaturGQ implomontod . in tho LEON pipoline to control tho data transfer botwoon LEObJ and 
XPPt 

When LEON tries to write to an lO Port containing a full FIFO or read from an lO Port 
containing an empty FIFO a trap is generated. This trap can bo handled through a trap 
5 handler. There is a further mechanism pipoline holding implemented, to allow LEON 
holding th o pipeline and wait, for free FIFO space during XPP wxite access respectively w^ait 
for a valid FIFO value during XPP read acc e ss. Whon using pipeline holding th e software 
developer has to avoid reading from an lO Port vrith cmpt>^ FIFO while the XPP, rospcctively 
the XPP input lO Ports, contains no data to produce outputs. In this case a deadlock will 
10 occur and the complete system has to bo res e t e d. 

XPP can generate interrupts for th e LEON when trying to read a value from an e mpt>^ FIFO 
port or to wTite a. value to a full FIFO port. Tho occurr e nce of interrupts indicates, that th e 
XPP array cannot process the n e xt stop becauso it has either no input value s or it cannot 
output tho result value. The interrupts generated by tho XPP aro maskabl e . 

15 Tho interface provides information about tho FIFOs. LEO^J can r o act the number of valid 
values the FIFO contains. 

Figure 72: LEON to XPP dual clock FIFO 

Tho interface to tho XPP appears to the LEO>J as a sot of special registers. (Figure 6). Those 
XPP registers can be cat e gorized in conmiunication registers and status regist e rs. 

20 For data exchange tho XPP communication registers aro used. Since XPP provides throe 
different, types of communication ports, there aro also throe typos of communication 
rcgistos, wh e reas over>^ typo is splitted into an input part and an output part: 

Tho data for the process aro accessed through XPP data registers. Tho number of data input 
and data output ports as well as tho , data bit width depends on the implemented XPP array. 
25 XPP can generato and consum e events. Events are on e bit signals; The number of input 
events and output events depends on tho implemented XPP array again. 

Configuration of the XPP is dono through tho XPP configuration register. LEON reads the 
required configuration value from a file stored in his system P.i\M and \mt03 it to tho XPP 
configuration regist e r. 
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Thcro aro a number of XPP status rogistor implomcntcd to control tho behavior and got status 
information of th e int e rfac e . Switching b e tween tho usage of trap handling and pipelin e 
holding can be done in the hold r e gist e r. A XPP clock register contains a clock fr e qu e ncy 
ratio betwe e n LEON and XPP. By writing this register LEON software can set the XPP 
clock relative to LEON clock. This allows to adapt the XPP clock frequency to tho required 
XPP performance and cons e qu e ntly to influ e nce th e power consumption of tho system. 
Writing zero to tho XPP clock register tums off tho XPP. At last there is a status rogistor for 
ovcr>^ FIFO containing tho numb e r of valid values actually available in tho FIFO. 

This status r e gisters provid e s a maximum of flexibilit>^ in communication betv/een LEON and 
XPP and e nables different commvmication mod e s: 

If th e r e is only on e application running on the syst e m, at th e time, software may b e 

eveloped in pipeline hold mode. Her e LEON initiates data read or wite from respectiv e ly to 
XPP. If there is no value to read r e spectiv e ly no value to wTite, LEON pip e line will b e 
stopped imtil read or write is possible. This can b e used to reduc e pow e r consumption of th e 
15 LEON part. 

In interrupt mode, XPP can influ e nce the LEON program flow. Thus, tho lO Ports generates 
an interrupt depending on the actual number of values available in the FIFO. The 
commimication betwe e n LEON and XPP as done in interrupt ser\ice routin e s. 

Polling mode is th e classical way to acc e ss the XPP. Initiat e d by a timer event 

20 LEON r e ads all XPP ports containing data and WTit e s all XPP ports containing free FIFO 
space. B e tween thes e phases LEON can compute other calculations. 

It is an>1:ime possible to switch between this strategies within one application. 

The XPP is d e livered containing a configuration manag e r to handle configuration anal 
reconfiguration of tho array. In this concept th e configuration manag e r is dispensable 
25 becaus e the configuration as w e ll as any r e configuration is controlled by tho LEON through 
tho XPP configuration r e gist e r. All XPP configurations used for an application ore stored in 
the LEON^s system RAM, 

6s-^4 — 4 . Tool and Compiler Integration 
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Tho LEO>rG SPi^Pl.C 8 instruction sot [1] was oxtendod by a now subGOt of inotructionG to 
malco tho new XPP rogistors acc Q GGable through softwaro. Thcoc inGtructiono aro baood in th e 
SPARC instruction format but they are not conform to tho SPAR.C V8 standard. 
Corrosponding to tho SPARC conventions of a load'^storo i\rchitoctur o the instruction subs e t 
can bo oplittcd in two gcnoral typ e s. Load/storo instructions can exchang e data botwoon th e 
LEON memory and tho XPP communication rogistors. Tho number of cycl o s por instruction 
aro similar to tho standard load/storo instructions of tho LEON. Road/wTite instructions are 
uGod for communications betw ee n LEON regist e rs. Since tho LEON register sot io oxtondod 



LEON alone 



LEON with XPP 



LEON with XPP 



in IRQ Mode 



in Poll Mode 



LEON with 

XPP 
in Hold Mode 



Configuration 
of XPP 



71.30 8 ns 
17. 8 27 cycl e s 



81.361 ns 
2L091 cycles 



77.976 ns 
19.191 cycl e s 



2D IDCT ( 8 x 8 ) 



11.672 ns 
3.668 cycles 



3.272 ns 
818 cycles 



3.872 ns 
968 cycl e s 



3.56 8 ns 
892 cycl e s 



Table 2 Performance on IDCT (8x8) 

10 by the XPP regist e rs tho road/wiito instructions are e xtended also to access XPP registers. 
Status registers can only bo accessed with r e ad/write instructions. Execution of arithm e tic 
instructions on XPP registers is not possible. Values hav e to bo written to standard LEON 
registers before thoy can bo targ e t of arithm e tic operations. . 

Tho complete system can still operate any SPARC V 8 compatiplo codo. Doing this, tho XPP 
15 is completely unus e d. 

Tho LEON is provided with th e LECCS cross compiler system [9] standing under th o terms 
of LGPL. This system consists of modified versions of th e binutils 2. 1 1 and gcc 2.95.2. To 
malco the new instruction subset available to software dev e lopers, the assembler of the 
binutils has been extend e d by a number of instructions according to tho implemented 
20 instruction subset. The now instructions have the same mnemonic as the regular SPi\RC V8 
load, store, road and write instructions. Only the now XPP registers have to be used as source 
respoctivoly target op e rand. Sinc e the modifications of LECCS are straightforward 
extensions, th e cross compiler system is backward compatible to the original version. Th e 
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availabilit)^ of tho Gourco code of LECCS hao allowod to oxtond tho toolo by tho now XPP 
oporations in tho doGcribed way. 

Tho dovolopmont of tho XPP algorithmG have to bo done with soparatc toolG, provided by 
PACT Corp. 

5 6^-5 Si Application Re s ult s 

Ag a fir o t analyoiG application a invorGO DCT appli e d to 8x8 pix e l blocic was implomontod. 
For all Gimulations wo UGod 250 MHz clock fir e quoncy for LEON procoGGor and 50 MHZ 
clock froquoncy for XPP. Tho uoago of XPP accoloratoG tho computation of the IDCT about 
factor four, depending on tho communication mode. How e ver XPP has to be configur e d be 
10 foro computing tho IDCT on it. Tablo 1 aloo ohowG tho configuration timo for thio algorithm. 
Ao ohown in Figure 71, tho benefit brought by XPP riooo with tho number of IDCT blocks 
computed by it before reconfiguration, go tho number of reconfigurationo during complex 
algorithms should b e minimis e d. 

A first complex application implem e nted on th e system is MPEG 1 d e coding. Th e 
15 optimization of the algorithm partitioning on LEO>J and XPP io still xmdor construction. In 
Figure 8 tho blockdiagram of tho MPEG 1 decoding algorithm is shown. Framoo with 320 x 
210 pixel was decoded. LEON by using SPARC V8 standard instructions docodoo one fi-ame 
in 23,16 ooconds. In a firot implomontation of MPEG 1 using tho XPP, only the IDCT is 
computod by XPP, tho rost of tho MPEG 1 decoding is otill don e with, LEON. Now, with the 
20 help of XPP, one framo is decoded in 17,98 g. This is a porformanco boost of more then 

twenty perc e nt. Since th e XPP porformanco gain by acc e l e rating th e iDCT algorithm only io 
very low in th e mom e nt, we w^ork on XPP implomontationG of Huffmann decoding, 
dequantiGation and prediction decoding. So tho porformanco booGt of this concept agaiuGt the 
standalone LEON will bo increaoed. 

25 65-6 — 65 Conclusion 

Today, the inotruction datapaths of modem microprocessors roach their limits by using static 
instruction s e ts, driven by the traditional von Neumann or Harv^ard architectural principl e s. 
Away out of thcGo limitationG is a dynamic reconfigurabl e processor datapath e xt e nsion 
achieved by integrating traditional static datapaths with th e coarGC grain dynamic 
30 reconfigurable XPP architecture (eXtreme ProceGGing Platform). Therefore, a loosely 
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a: , > 'i iclu onoua coupling mochanism of the givon inotruotion datapath hoo boon devolopcd and 
iiitcg i atcd onto a CMOS 0. U\im otandard coll tochnology from UMC. Hore, tho SPARC 
compatible LEOM PJSC procoooor is used, whoroao its static pipolinod inotruotion datapath 
has boon oxtcndod to bo configurod and poroonalizod for specific applications. This 

5 oompilor oompatiblo instruction set cxtonoion allows, a various and efficient use, e.g. in 
sfeeam ing application domoino lilco MPEG ^, digital filtoro, mobile communication 
modulation, oto. Tho introducod coupling tochniquo by flexible dual clook FIFO intorfaces 
allows asynchronous conovuTonoy and adapting tho froquonoy of tho configured XPP datapath 
dependent on aotual performance requirements, e.g. for avoiding unnooded cyoloo and 

10 reducing power consiunption. 

Ao roprooonted above, tho introducod concept combines tho flexibility of a general purpose 
microprocessor with the porformanco and power consumption of coarse grain reconfigurable 
datapath struotures, nearly comparable to ASIC porformanoe. Hore, two programming and 
computing paradigms (control driven von I'Jeumarm and transport triggered XPP) are unified 

15 within one hybrid arohitooturo with tho option of two clook domains. The abilitj^ to 

reconfigure tho transport triggered XPP malteo tfio system independent from standards or 
specific applications. This concept opens potential to develop multi standard communication 
devices like software radios by using one oxtcndod processor architecture with adapted 
programming and compilation tools. Thus, new standards can be easily implemented through 

20 software updates. The system is scalable during design time through the scalable arraj^ 
structure of tho used XPP extension. This extends tho range of suitable applications from 
produots >vith loss multimedia fimctions to complex high performance systems. 
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Second Major part 

Anothor aopoct of tho pres e nt invontion will bo deocribod in tho following. Thio aopoct deals 
with problems relating to the implomontation of hypor threading, multi threading, multi 
taslcing, sch e duling and e x e cution of parts of configurations and so forth. 

5 It is noted that WO 00/19196 already discloses a method for execution of a computer 
program, using a proc e ssor comprising a configural functional unit capable of ex e cuting 
roconfigurabl e instructions in fact of which can redefined at runtime. 

A problem in convcntionable processor architectures e xists if a coupling of for exampl e 
sequentional processors is needed and/or technologi e s such as a data streaming, hypQV 

10 threading, multi threading and so forth shall be used in a useful way enhancing performance. 
Techniques Icnown in prior art, such as WO 02/50665 Al, do not allow for a sufficiently 
efficient way of effecting data exchange between the ALU of a CPU and the configurable 
data proc e ssing logic cell field, bo it an FPGA, DSP or the like, that data ex change being 
cffoctod via r e gisters in the prior art. In oth e r words, it is necessary to first sequ e ntially write 

15 data into a register and then retrieve them sequentially and restor e them sequentially as w e ll. 

Another probl e m exists, if an extemal access to data is requested in loiown devic e s such as 
those cited used intoralia to impl e m e nt functions in the configurable data processing logic 
coll field, DFP, FPGA or the like, that can not be processed sufficiently on tho CPU 
integrated ALU. Accordingly, tho data processing logic cell field is practically used to allow 

20 for user d e fined opcodoG that can process data more efficiontly than is possible on tho ALU 
of tho CPU without further support by the data proccGoing logic cell field. In tho prior art, th e 
coupling is generally word based, not block ba s ed. It is a further important aspect of the 
present inv e ntion that is has been realized that for data streaming data processing block 
based coupling is highly preferable. At any rate, a more efficient data processing, in 

25 particular more efficient than possible with a close coupling via registers, is highly 
preferable. 

i\nother method for tho use of logic cell fields consisting of coarse and/or fine granular logic 
colls and logic cell e l e m e nts consists in a ver>^ loose coupling of such a field to a 
conventional CPU and/or a CPU core in embedded systems. Hero, a conventional sequential 
30 program can be executed on the CPU, for ex ample a program wxitton in C, C+4- or tho like, 
wherein the instantiation of tho data stream processing by the fine and/or coarse granular 
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data procosging logic coll fi e ld io offocted via that sequontial program. Then, tho problem 
existo that in programming Gaid logic coll field a program not WTitten in C or any other 
Goqucntial high Icvol language must be provided for the data stream processing. It would be 
preferable here to allow for C programs to run both on a conventional CPU architecture as 
5 well ao on tho data processing logic cell field operated therewith; in particular despite the fact 
that a quasi sequontial program e x e cution should maintain the capability of data streaming 
in tho data processing logic cell fioldo whorcas simultaneously tho capability exists to operat e 
the CPU in a not too loosely coupled way. 

It is already Ioiov^ti to provide for s e quential data processing within a data proc e ssing logic 
10 coll fiold, see for example BE 196 51 075, WO 98/26356. DC 196 51 816, WO 98/29952, BE 
197 01 72 8 , WO 98/35299, BE 1 99 26 538; WO 00/77652, BE 102 12 621. Hero, partial 
oxocution is achieved within a single configuration, for example to reduce tho amount of 
resources nooded, optimize tho time of execution and so forth; however, this does not load 
automaticall)^ to allowing a programmer to translate or transfer high lovol language code 
15 automatically onto a data proc e ssing logic coll field as is tho case in common machin e 
modoUs for s e quential processes. The compilation, transfer or translation of a high level 
language code onto data processing logic coll fields according to tho methods laiown for 
modells of s e quentially executing machines do e s remain difficult in fact. 

In tho prior art, it is fiirthor Imown that configurations that effect different fimctions on parts 
20 of the ar e a respectiv e ly can b e simultaneously executed on th e processing array and that a 
change of one or some of tho configuration(s) without disturbing other configurations is 
possible at run tim e - 
Methods and hardwar e implement e d moans for tho impl e mentation are Icnovm to ensure that 
the o}[ocution of partial configurations to bo loaded onto tho array is possible without 
25 deadlock. Reference is being made to BE 196 51 593, WO 9 8 /31102, BE 19 8 07 8 72, WO 
99/1 1 117, BE 199 26538, WO 00/77652, BE 100 28 397, WO 02/13000. This technology 
allow^s in a c e rtain w^ay a certain parallelism and, given c e rtain forms and interrelations of 
that configurations or partial configurations for a certain w^ay of multitasldng>Wlti threading, 
in particular in such a w^ay that the planning, that is the scheduling and^or tho planning 
30 control for time use, can be provided for. Furthermore, from the prior art, tim e use planning 
control moans and m e thods are Icnovm por se, that, at least xmder a corresponding 
interrelation of configurations and/or assignment of configurations to certain tasks and/or 
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threads to configurations and/or GoquoncoG of configurationo allow for a multi taoldng and/or 
multi threading. The uoo of such time us o planning control moans used in tho prior art for 
configuring and manag e m e nt of configurations for the purpose of sch e duling of tasks, 
throads, multi and hypor threads is consider e d to bo inventive per se. 

5 In at l e ast a partial aspect of preferred variants, it is preferable to provide for support of 

modem tcchnolo gies of data processing and program execution such as multi tasking, multi 
threading, hyper thr e ading and so forth. 

Another important asp e ct ln embodiments of the present inventio n can bo soon in that^ data 
are inputted into thatthe data processing logic cell fields in response to the execution of a load 

1 0 configuration by thatthe data processing logic cell fields^ and/or in that data are stored from 
thatthe data processing logic cell fields by executing a store-configuration. Accordingly, it is 
preferred to provide the load- and/or store-configurations in such a way that the addresses of 
those memory cells used are directly or indirectly generated within the data processing logic 
cell fields, the addresses indicating those memory cells and/or locations to which an access 

1 5 has to be effected as a,load- and/or store-access, that is asj ^ a read- and/or write-access. By 
configuring address generators within the configuration it becomes possible to load a 
plurality of data into the data processing logic cell fields where they can be stored in internal 
RAMs (IRAM^ and/or within the internal cells such as EALUs having registers and/or in 
other dedicated memory and/or storage-«eafts. The load- or store-configuration^ 

20 respectively^ thus allows for a block wis e blockwise and thus almost data-stream-like loading 
and storing of data, this being in particular much faster than a single acces and can be 
executed prior to or during the execution of one or more data processing - and/or data 
handling in a data altering manner - configurations processing thatthe preloaded data. 

The data loading can take place, provided that that logic cell fields are, as is typically the 
25 case, sufficientiy large,- in small partial areas thereof, while other partial areas are executing 
other tasks. For example, in tho ping pong like data procosoing doscribod in other published 
documents by tv.r» prnmnt npplinnnt^ that know m PACT is discussed a ping-pong-like data 
processing felyifi gthat relies on memory cells provided on each side of fliatthe data 
processing fiel d, where data in^ a first-4^ processing step , data stream from the memory 
30 on one side through the data processing field to the memory on the other side of thatthe data 
processing fiel d, wh e re th e v . The data are stored there as intermediate results while, if 
necessary, the array is reconfiguredrthe Jie intermediate results are-then streaming 
feaekstream for fiirther processing and so forth ^. Here, a memory strip on one side and/or 
memory part on one side can be preloaded with data by a load configuration in one array part^ 
35 while emn the memory part on the other side of the logic cell field data are written out using 



NY01 1113556V1 



216 



MARKED-UP VERSION OF THE 
SUBSTITUTE SPECIFICATION 



a store-configuration. It is to bo notod that ouch Such a simultaneous Ioad-/store-way of data 
processing is possible even without spatial distribution and/or separation of memory areas in 
which data are retrieved and/or in which data are stored. 

It is possible to effect the data loading from a cache and/or into a cache. It is advantageous, 
tfTn one embodiment, the external communication to large memory banks tsmay be handled 
via a cache controlling unit without having to provide for separate circuitry within the data- 
processing logic cell field in that th e J ie access in a writing or reading manner to cache- 
memory-means typically is very fast and has a small latency (if any) . Also , and in that also 
typically a CPU-Unit is, hero typicalh f or example, via a load-/store-unit, coupled to the 
cache so that access to data and an ex-change thereof between the CPU-core and the data 
processing logic cell fields can be effected fest quicklv . block-wise, and such that not every 
single datum needs to be transferred via a separate instruction that must be fetched, for 
example, by the opcode-fetcher of the CPU and processed therein. 

This cache-coupling is also highly proforrod oomporod tomay be much better than the 
coupling of the data processing logic cell field to the ALU WitfiwiA the CPU via registers, if 
those registers communicate only via a load-/store-unit with the cache, as is laiown in the 
nnn PACT prior art, t he conventional case. 

Tt , n pnrnhln tn prnviHn fnr ln an embodiment of the present invention, a further data 
connection mav be provided to and/or from the load-/store-unit of tiie, or one of ^latttie, 
sequential-CPU-units connoctin p connected to the data processing logic cell fields and/or - 
their registers. 

It is tci hp mfnti^n"^ ^hnt it pnrriMn tn fiHrpn- i possible to address units via separate input- 
/output ports of the data processing logic cell field, which can in particular be provided as a 
VPU or XPP, and/or to adronn tha t address the data processing logic cells via one or more 
multiplexers downstream a single port. 

It ic al:o to b? m? n tirn"i t^"^ hr^riAn, nfthn hinrlr unrpResides the blockwise and/or 
streammg and/or random mode access to cache areas in a writing and a reading manner 
and/or to the load-/store-unit anehe fand/or the^e^^ known connection to the registers of a 
sequential CPU it is aloo poosible to provide for a oonnoction, in an embodiment of the 
present invention, a connection is provided to an external mass memory such as.a RAM, a 
bafddis ehard disc or any other data exchange or input or output port such as an antenna-and 
nn fnrth Tt ir pnnrihln tn pmvide fo f . etc. In embodiment, separate ports may be provided 
for the access to several of such units and/or memory means. It io also to be montionod that 
swtableSuitable drivers, signal conditioning circuitry, and so fortht-sheal d may accordingly 
be provided. Furthermor e, it is to bo montionod that in particular , although not exclusively 
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for the handling of a data stream streaming into thatthe data processing logic cell field and/or 
out of tothe data processing logic cell fields^ the logic cells of thatthe field can 
comprise include ALUs or EALUs^ respectively, which can- have at their input- and/or output 
ports short, fine-granularly configurably FPGA-like circuitries, for example^ to cut out 4-bit- 
blocks out of a continuous data stream as is necessary^, for example, for an MPEG-4- 
decoding. This t smav be advantageous on ono hand , for example , if a data stream is to be 
inputtedinput into the cell and is to be processed or preprocessed without blocking larger 
PAE-units. Tt IP pnrtirnlnr nf nHvnntngQ- if tho ALU is in an emb o diment of the present 
invention, the ALU may be provided- as aan SIMD-ALU ; horo, for^ Jor example, a very 
broad data word having, for example,_abroad 32-bit-data-width 4s mav accordingly be split 
via an FPGA-like stripe in fi-ont of the SIMD-ALU splitting tho broad e.g., 32 bit data into 
eight data words having, for example^ 4-bit-data-width that can then be processed parallelly 
in the SIMD-ALU, increasing the overall performance of the system significantly, provided 
that the respect of applications are needed. 

Furthermore, it is noted that when reference is being ha dmade to- FPGA-like pre- or post feat 
structures, it is not absolute necessary to refer to 1 -bit-granular devices ; in contrast. Instead , 
it would be possible to provide finer-granular structures of a, for example, 4-bit-, instead of 
the hyper-fine-granular 1-bit, structures. In other words, the FPGA-like inputs and/or output- 
structures, in fi-ont of or data downstream of th^ttie ALU-unitrin^ particular, SIMD-ALU- 
units ar emav be configurable in such a way that always 1 -bit-data-words are always 
processed. It is also possible to provide for a cascading, so that, for example, incoming 32- 
bit-data-width words are separated into 4-bit parts by 8-bit FPGA-like structures in sequence 
of each other, then procooGing thatt he four 8-bit data words are processed in four FPGA-like 
8-bit-width structures, the n providing for a second stripe of 8 separate 4-bit-wide FPGA-like 
structures are provided, a nd, if necessary, provide for e xample sixteen separate having 
parallel 2-bit FPGA-like structures , for example, are provide . If this is the case, a significant 
reduction of the overhead compared to a hyper-fine-granular 1-bit FPGA-like structure can be 
achieved. This allows to mav allow for significantly rcduction r educing the configuration 
memory and so fort h , etc., thus saving on silicon area. 

It is te-b€^noted that ^ hnanv offecrtthe coupling advantages are achiovabl e may be achieved 
using data block streams via a cach e: howev e r . However , it is preferred in particular if 



^tatthe cache is built slice-v^se and if an access onto several slices can talc e plac e 
simultanoously,_and in particular onto all slices at tho oamo time, can take place 
simultaneously . It i smav be advantageous i f, as will bo understood, the data processing logic 
cell field (XPP) and/or the sequential CPU and/or CPUs feave4e-process a plurality of 
threads, be4 twhether by way of hyper-threading, multi-tasking, and/or multi-threading. It 
i smay also be preferable to provide cache-storage meMis-vdth slice access and/or slice access 



NY01 1113556V1 



218 



MARKED-UP VERSION OF THE 
SUBSTITUTE SPECIFICATION 



enabling control^Heans. For example, every single thread can be assigned a separate slice 
thns , thereby allowing that on processing that thread the respective cache areas are accessed 
on the re-entry of the group of codes to be processed. How e v e r, it is to bo xindcrstood that th e 
cache needs However, the cache need not necessarily be separated into slices and that, every^ , 

5 even if the cache is separated into slices, not every single thread must be assigned a separate 
<^lic e HovvGvor, it is to bo noted tha t although this maybe [[is]] a highly preferred method. 
Furthermore, it is to be noted that there may be cases wh e r e i nw here not all cache areas are 
used simultaneously or temporarily at a given time. Instead, it is to be expected that in 
typical data processing applications^ such as in hand-held mobile telephones, laptops, 

1 0 cameras and so fort h , etc., there will e xis tm ay be periods during which not the entire cache is 
needed. ^Btu sAccordingly , it i smay be highly pr e f e rred i f advantageous that certain cache- 
areas can be separated from the power source in such a way that- the energy consumption is 
significantly reduced, in particularlv p articular, close to or exactly to 0. This can be achieved 
by aj)ower supply separation mean sarrangement adapted to separate cache slices from 

15 power. The separation can either be effected by a dovm-clocking, separation of clock-lines^ 
and/or the overall separation of a power supply. It is in in particular , it may be possible to 
provide for such a separation for every single cache slice, for example^ by an access 
identification arrangement adapted to identify whether or not a thread, hyper-thread, task^ or 
the like is currently assigned to a respective cache slice. In case the access identification 

20 means arrangement indicates and/or detects that this is not the case, tvpicallv t here may be a 
separation of slice from a clock-line and/or even the power-line can b e e nvisaged as possibl e. 
It is also to b e noted that on repowering-up after a separation from power^ it is possible to 
immediately access the cache arear4hu s. Thus, no significant delay by switching an ON or 
[[OF]] OFF o f the power is to be expected, as long as the hardware is implemented v^th 

25 current semiconductor technologies. 

A fiirther particular advantag e in embodiments of the pfevent present inventio n resid e s in th e 
fact that ^ although the transfer of data and/or operands is possible in a block-wise marmer, no 
particular balancing is needed to ensure that exactly the same times of execution^ of data 
processing steps in the sequential CPU and the XPP and/or other data processing logic cell 

30 fields are achieved. Instead, the processing ismay frequently be^independent, in particular in 
such a way that the sequential CPU and the data processing logic cell field can be considered 
as separate resources by a scheduler. This allows for the immediate implementation of 
known data processing programs splitting technologies such as multi-tasking, multi- 
threading, and/or h yper-threading. Tho advantage that A result of a data path balancing is-not 

35 being necessary has as a r e sulti g that^ for example^ in a sequential CPU a number of-pipeline 
stages may be included, clock frequencies and/or schemes of clocking may be achieved in a 
different wa y and so forth^ jgtc. It is a particular advantage if asynchronous logic is needed. 
A further advantage In an embodiment of the present inventio n roGults in that^ by configuring 
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a load- and a store-configuration into the data processing logic cell fields^ the data inside the 
field can be loaded into that field or out of that field which is no mor enot controlled by the 
clock frequency of the CPU, the performance of the opcode fetche r and so fort h , etc . In other 
words, the opcode fetcher is no more does not bottle n e ckin g neck the data throughput to the 
5 data logic cell field without having an only loose coupling. 

In a particularly pref e rr e d an example embodiment of the present invention, it is possible to 
use the known CT or CM (commonly employed in the XPP-unit, also given the fact that with 
one or more, even hierarchically arranged XPP-fields having in pr e ferred some embodiments 
their own CTs while simultaneously using one or more sequential CPUs) here-as a quasi 

10 hyper-threading hardware-management unit having , which may have the inh e r e nt advantage 
that known technologies^ such as FILMO and so fort h others, become applicable for the 
hardware support and management of hyper-threading and so forth; i t etc. It is altematively 
afid/of possible, in particular in a hierarchically hierarchical arrangemen t also possible ^ to 
provide the configurations from the opcode-fetcher of a sequential CPU via the coprocessing 

1 5 interface^ allowing to instantiat o for instantiation of an XPP and/or data processing logic cell 
field call by the sequential CPU to effect data processing on the data processing logic cell 
field. Cache coupling and/or load and/or store configurations providing adress generators for 
loading and/or storing of data into featthe data processing logic cell field or out of that field 
wi Umav provide for the data exchange of the XPP. In other words, the coprocessor-like 

20 coupling to the data processing logic cell field i smav be enabled while^ simultaneously^ a data 
stream-like dataloading is effected via cache- and/or I/O-port coupling. 

It is to b e m e ntion e d that the The method of coprocessor coupling, that is the indicated 
coupling of the data processing logic cell field^will , may typically result in the scheduling of 
featthe logic cell field taking ^place on the sequential CPU and/or a supervising scheduler 

25 unit and/or a respective scheduler means. In such a case, the threading control and/or 

manag e w e nt management practically takes place on the scheduler and/or the sequential CPU. 
Although this is pei^-se-possible, this will not be-necessarily be the case where the most 
easyeasiest implementation of the invention is sought. The data processing logic cell field 
can be called in a conventional maimer, such as has been the case in a standard coprocessor 

30 such as a combination of 8 0 8 6 1 8 087. 8086/8087. 

It is to be mention e d that in th e particularly pr e ferrod i n one example embodiment, 
independent of th e way of its configuration, e.g., as be b e this a coprocessor interface, the 
configuration manager acting as scheduler at the same time or in any other way^ it is possible 
to adfes saddress memory within or in an immediate vicinity of the data processing logic cell 
35 fields or under its management, in particular memory within the XPP-architecture-as4s 
Icnov i Ti by the applicant , RAM-PAEs or oth e r: accordingl y , etc. Accordingly, managing 
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internal memories such as a vector register ir. raiggnr.t e d, that ir»mav be advantageous. That is, 
the data volimies loaded via the load configuration ar emay be stored vector-like in vector 
registers in the intemal-memory-cells,_and thereafter acc es sing said registers may be accessed 
after loading and/or activating of a new configuration for effecting the actual data processing. 
(It is te-be-noted that a data processing configuration can be referred to as one configuration 
even in a_case tha twhere several distinct configurations are to be processed simultaneously, 
one after the other or in a wave-like modus. ) 

Horo, qA vector register can be used to store results and/or intermediate results in the internal 
or internally managed memory cell elements. The vector register-like accessed memory 
means in the XPP can be used also, after reconfiguration of the processing configuration by 
loading a store configuration in a suitable manner^ in a v^ay that takes place again in a data- 
stream-like manner, be it via an I/O-port directly streaming data into external memory areas 
and/o r, as particularly pr e f e rr e d, into cache areas or out of these which then can be accessed 
at a later stage by the sequential CPU and/or other configurations- executed on the other data 
processing logic cell field, in-particularly jn a data processing logic cell field having produced 
said data in the first place. 

In n pnrficularly proforro d one example embodiment, at least for certain data processing 
results and/or intermediate fe-s«ks rresults. for the memory and/or memory regist e r 
meafi sregisters into which that data the processed_data are to be stored, not an internal 
memory, but instead a cache area having access reservation, ie-particularly cache areas which 
are organized in a slice- wise manner^ can be used. This can have the disadvantage of a larger 
latency, in particular if the paths between the XPP and/or data processing logic cell fields to 
or from the cache are of considerable length such that signal transmission delays need to be 
considered. Still, this allows to avoid may allow for additional store configurationsjobe 
avoided . It is also te-be-noted^ that such at his way of storing data in a cache area be- 
eemes becomes, on the one hand^ possible by placing the memory^ into which data are stored^ 
physically close to the cache controller and embodying that memory as a cache, but that 
alternatively and/or additionally the possibility exists to submit a part of a data processing 
logic cell field memory area^^ intemal memory , such oq e.g., in the "RAM over PAE" case 
under the control of one or several cache-memory controller(s>-, e.g., in the "RAM over 
PAE^' case. 

This i smay be advantageous if the latency in storing the data processing results are to be kept 
small in storing data processing results , while latency in accessing the memory area serving 
as a quasi-cache to other units will not be too significant in other cases. 

It is also to bo montionod that in an embodiment is poGsiblo in ouch a way thatof the present 
invention, the cache controller of the known sequential CPU adfessesmay address as a cache 
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a memory area that is, without serving for the purpose of data exchange with a data 
processing logic cell field4Si physically placed onto that data processing logic cell field 
and/or close to that field. This i smav be advantageous in that, if applications are run onto the 
data processing logic cell fields having a very small local memory need and/or if only few 
5 other configurations compared to the overall amount of memory space available are needed, 
these memory areas can be assigned to one or more sequential CPUs as cache or additional 
cache. It is to b e mention e d that if ini n such a case the cache controller may be adapted for 
the management of a cache area having a dynamically varying size. 

A dynamical dvnamic cache-size management and/or dynamical dynamic cache management 

10 size means for the dynamical dynamic cache management may particularly take into account 
the work load on the sequential CPU and/or the data processing logic cell fields. In other 
words, it could b e analyz e d for e xampl e so as to enable fast reconfiguration Twhether b y wa y 
of wave-reconfiguration or in any other way), how many NOPs in a given time unit are 
executed on the sequential CPU and/or how many configurations are preloaded in— the 

15 dynamically reconfigurable field in the memory areas provided therefore , so as to enable fast 
r e configuration (be it by the way of wave r e configuration or in any other way). Th e 
dynamical may be analyzed. The dynamic cache size or cache size management disclosed 
hnrnwith ir.^ \n a highly pref e rr e d mann e r, h erein may be runtime dynamicalrfeat^^Th^ is, the 
cache controller controls m av control a momentary cache size respectiv e ly t hat can be 

20 changed firom clock-cycle to clock-cycle or from one group of clock-cycles to ^ 

efee f another . It is also te^noted that the access management of a data processing logic cell 
field with access as internal memory^ such as vector register^ is possible. While^ as 
previously discussed above, a configuration management unit can be provided, it is now to b e 
explicitly noted that such units and their way of operation^ allowing in particular the 

25 preloading or configurations currently n ot yet needed^ can be used very easily to effect the 
multi-task operation and/or hyper-threading and/or multi-threading, in particular for task- 
and/or thread- and/or hyper-thread switches. Hero, it is also noted that during During the 
runtime of a thread or a task^ it is possible to preload configurations for different tasks and/or 
threads and/or hyper-threads into the PAE-array. This the nmay allows tefor a preload of 

30 configurations for a different task anand/or thread if the current thread or task can notcannot 
be executed, for example because data must be v/aitod for, bo it thatare awaited, whether 
where they have not yet been received, for example due to latencies, be it bccaus e or where a 
resource is blocked by another access. In case of the configuration preloading for a different 
task or thread, a switch or change becomes possible vsdthout the disadvantage of a timing 

35 overhead b e caus e due to the^ for example^ shadow-like loaded configuration execution. 

It is in principal p rinciple possible to use this technique also in cases where the most likely 
continuation of an execution is predicted and a prediction is missed. However, this way of 
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operation vvill hn most pr e f e rr e d mav be particularly advantageous in cases free of predictions. 
When using a pure sequential CPU and/or several pure sequential CPUs, the configuration 
manager thus also acts as and realizes a hyper-threading management hardware. Rcfcrcnoo io 
being had to BE 198 07 ^7?, nn^^A^A'7 nnH wn 00/^^ 1 ?0 T t can be considered as 
sufficient, in particular in case where the CPU and/or several sequential CPUs sh^have a 
hyper-threading management^ to keep partial circuitry elements such as the FILMO? 

^ .i^nH ,n tT.n Hnnnmnntr rnfnrrpH tn discussed in BE 1 9 8 07 872. WO 99/44147, and WO 

99/44120. Tn partif iilar tho ur .e o fi n an embodiment of the present invention, the 
configuration manager dnr.r.rihnd th e r e i n discussed in these documents with and/or without 
FIT .MO mav be provided for use with the hyper-threading management for one and/or more 
purely sequential CPUs withi or without coupling to a data processing logic cell field4s 
diooloood horowith and olcdmod oo invontivo; although it io aloo to bo noted that not in all 
casoB whore now and invontivo foaturoo aro disolosod in tho prosont application thoso feature s ] 
aro explicitly roforred to aa b e ing inventiv e . ^ 

It is aloo to be noted that the plurality of CPUs can be realized with known techniques-s«eb^ 
known^ for pvample-frftm, such as those discussed DE 102 12 634^21 and PCT/EP 
02/10572. It is also te-be-noted that DE 106 51 075, DE 106 54 846, DE 107 04 7Q»j-72^ 
WO 98/26356, WO 98/3995329952^ and WO 98/35299 diselesediscuss how to implement 
sequencers having ring- and/or random-access memory means in data processing logic cell 
fields. 

It is te-be-noted that a task-, thread- and/or hyper-thread switch roop o ctivoly can be effected 
with the known CT-technology such that performance-slices and/or time-slices are assigned 
to a software implemented operating system scheduler by the CT, during which slices it is 
determined which parts of tasks and/or threads are subsequently to be executed provided that 
resources are available. 

A ttThe following is an exampl e is given ao follows: ^ First, an adressaddress sequence is 
generated for a first task during which the execution of a load configuration loads data firomT 
a cache memory coupled wi^to the data processing logic cell field in the described manner. 
As soon as the data are present, the execution of a second configuration , the actual data- 
processing configuration^ can be started. This configuration can be preloaded as welU since it 
is certain that this configuration is to be executed provided that no interrupts or the like cause 
task switches. In conventional processes there is newtivs, known-the problem of the so ^called 
cache-miss, vv^tefe mwhere data are requested that are not yet available in the cache. If such a 
case occurs in the coupling according to embodiments of the present invention, it is possible 
to switch over to another thread, hyper-tiiread and/or task, in particular that has in particularly 
been previously determined as the one to be executed next-bj^^ in particular_by_ttie 
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software implemented operating systems schedular and/or anetfeeFother hard- and/or software 
implemented unit operating accordingly^ and that has thus been preloaded in an available 
configuration memory of the data processing logic cell field, in particular preloaded in the 
background during the execution of another configuration, for example the load configuration 
5 which has effected the loading of data that are now waited for, awaited. 

It is to b e noted that it is possible to provide for separate configuration lines, these beings e.g., 
separate fi-om communication lines used in the connection of-the^ in particular^Jhe coarse- 
granular data processing logic cells- of the data processing logic cell field. Then, if the 
configuration to which, due to the task-, thread-, and/or hyper-thread- switch, processing has 

10 been switched over^ has been executed, and in particular has been in the preferable non- 
dividable, uninterruptable, and hence quasi atomar configuration executed until its end, a 
fiirther other configuration as predetermined by that scheduler, ift-particularly said operating 
system-like scheduler is execut e d, and/or a configuration^ for which the assigned load 
configuration has been executed may be executed. Prior to the execution of a processing 

1 5 configuration^ for which a load configuration has been executed previously, a test can be 

mad eperformed to determine whether or not the respective datedata have been streamed into 
the array, e.g., checking if the latency time which typically occurs has lapsed and/or the data 
are actually present. 

In other words, latency times which occur as configurations are not yet preloaded, data have 
20 not yet been loaded, and/or data have not yet been stored, are bridged and/or covered by 
executing threads, hyper-threads, and/or tasks which have been preconfigured and which 
process data that are already available or can be written to resources that are available for 
writing thereto. In this way, latency times are covered and/or bridged and, provided a 
sufficient number of threads, hyper-threads, and/or tasks are to be executed, the data 
25 processing logic cell field wittcan have an almost 1 00 % load. 

In embodiments of the t he present svste minvention, it is possible to realize a real time system 
despite the coupling of the array to a sequential CPU-afid, in particular, while still having a 
data stream capability. In order to ensure real time capabilities it must be guaranteed that 
incoming data or interrupts rospectiv e ly signalizin g signaling incoming data are reacted upon 

30 without exceeding an allowed maximum time. This can be effected by causing a task switch 
on an interrupt and/or, for example, if the interrupts have a certain priority, by determining 
that a certain interrupt is currently to be ignored, which has to be determined within a certain 
time as well. A task switch in such systems capable of real time processing will thus 
typically be possible in one of three wnyn . nnm o lv Githerinstances. which are when a task has 

35 run for a certain time (watch dog-principle), netatnon-availability of a resource, be4twhether 
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due to a blockade^ due to another access^ or due to latencies^ and/or in tho cas e of at the 
occurrence of interrupts. 

A way of implementing one of these variants may ensure the real time capability. In a first 
alternative, one resource which is under the control of the CT ederor scheduler switches over 

5 to processing the interrupt ^If the allowed response time to a certain interrupt is so long that 
the configuration currently configured can- be executed without interruption this is xmcritical, 
particula r particularly in view of the fact t hat the interrupt handling configuration can be 
preloaded. The selection of the interrupt handling configuration to be preloaded can be 
carried out by the CT or in any other instanc e w ay . It is also possible^ to restrict the runtime 

1 0 of the configuration on the resource to which the -interrupt processing has been assigned. 
Roforonco is being had to Resarding this> see PCT/DE 03/000942. 

If the system has to react faster if an interrupt occurs, it can b e pr e f e rr e d to r e ser\^e inone 
embodiment, a single resource, for example^ a separate XPP-unit or parts of a data processing 
logic cell field , may be reserved for the execution of interrupt handling routines. In this case, 
15 it is also possible to preload interrupt handling routines for interrupts that are particularly 
critical. It is also possible to immediately start loading of an interrupt handling routine once 
the interrupt occurs. The selection of the configuration necessary for a respective interrupt, 
can be effected by triggering, wave-processin g and so forth. , ,_etc. 

It is to b e m o ntioncd that by By the methods described^ it becomes possible to provide for an 
20 instantaneous reaction to the interrupt by using load-4/store configurations in order to obtain a 
code-reentrancy. Horo, following FoUowing every single or every other data processing 
configuration, for example every five- or ten data processing configurations, a store 
configuration i smay be executed and then a load configuration accessing the very memory 
arrays in which data have been-just been written i smay be carried out. Then, it-only has to b e 
25 made sur e t hat the memory areas used by thatthe store configuration remain untouched has to 
be ensured u ntil the configuration or group of configurations tha tfor which the preloading has 
been effected fef-has been finished by completely executing a further store configuration. In 
such at his way of intermediately carried out load-/store configurations and simultaneous 
protection of not yet overaged store-memory areas^ code-reentrancy is generated very easily, 
30 for example in compiling a program. Here, resource reservation may be advantageous as 
well. 

Further, a particular pr e f e rr e d in one example embodiment o f the present invention, a reaction 
to an interrupt consists in m av include using interrupt routines where code for the data 
processing logic cell field is forbiddenrSHS^_This embodiment being preferred whets mayibe 
35 particularly suited for an instance where one of the resources available is a sequential CPU. 
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In other words, an interrupt handling routine is executed only on a sequential CPU without 
calling data processing steps or routines making use of a data processing logic cell field. This 
guarantees mav guarantee that the processing on the data processing logic cell field is not4e 
be interrupted and th e n Jien, fiirther processing on thatthe data processing logic cell field 
5 can be effected following a task switch. Although the actual interrupt routine does not 
compriseinclude any data processing logic cell field code such as XPP-code, it can still be 
made sureensured that^ at a later time no more relevant to real time processing capabilities^ 
the data processing logic cell field reacts to an interrupt and/or a real time request determined, 
to state, information and/or data using the data processing logic cell field. 

10 
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Third major part 

Abstract — Nowadays, tho datapaths of modom microprocooooro roach their limito by using 
static inotruotion sots. A way out of this limitations is a dynamic roconfigurablo proc e ssor 
datapath oxtonsion achicx^od by integrating traditional static datapaths wth the coarso grain 

5 dynamic reconfigurable XPP architecture (oXtremo Processing Platform). Thorofore, a 
loosely asynchronous coupling m e chanism of the corr e sponding datapath units has been 
developed and integrated onto a CMOS 0.13 ^m standard cell technology from UMC. Here 
the SPi\RC compatible LECTJ processor is used, whereas its static pipelined instruction 
datapath has been extended to be configured and personalized for specific applications. This 

10 allows a various and efficient use, e,g. in streaming application domain s like MPEG 1, digital 
filters, mobile communication modulation, etc. Tho choson coupling tochniquo allow s 
asynchronous concurrency of tho additionally configured compound instructions, which ore 
integrated into the programming and compilation environment of the LEON processor. 

64 Introduction 

15 Fourth major part 
4 Introduction 

Compiling an HLL Subset Extended by Port Access Functions to an RDFP 
Th\r, documen tThe following describes a metho d, according to an embodiment of the present 
20 invention, for compiling a subset of a high-level programming language (HLL)4ike,_e^ C 
or FORTRAN, extended by port access functions^ to a reconfigurable data-flow processor 
(RDFP) as describ e d in Section 3.^ The program i smav be transformed to a configuration of 
the RDFP. 

25 This method can be used as part of an extended compiler for a hybrid architecture consiGting 
e fincluding a standard host processor and a reconfigurable data-flow coprocessor. The 
extended compiler handles a MX HLL44ke,e^ standard ANSI C. It maps suitable program 
parts4ik e> such as inner loops^ to the coprocessor and the rest of the program to the host 
processor. It is also possible to map separate program parts to separate configurations. 

30 However, these extensions are not the subject of this document, the discussion below. 

2 Compilation Flow 
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This soction briefly doGcribes tho phasoG of the The compilation method- may include a 
frontend phase, a control/dataflow graph generation phase> and a c onfiguration code phase> 

2A Frontend 

The compiler nse smav use a standard fi-ontend which translates the input program^ (e.g.^ a C 
program) into an internal format consisting o fi ncluding an abstract syntax tree (AST) and 
symbol tables. The firontend may also porformsE ^fonn well-known compiler optimizations 
a s. e.g., constant propagation, dead code elimination, common subexpression elimination-etOr 
For details, refer to any compiler construction textbook like [1). Tho SURF compil e r [2], ,.gtc^ 
For details regarding this, see A.V. Aho. R. Sethi, and J.D. UU man, "Compilers - Principles, 
Techniques, and Tools." Addison- Weslev 1986. The SURF compiler is an example of a 
compiler providing such a frontend. Regarding the SURF co mpiler, see The Stanford SUIF 
Compiler Group Homepage at http://suif.stanford.edu. 

2^3 Control/Dataflow Graph Generation 

Next, the program is mav be mapped to a control/dataflow graph (CDFG) consisting 
o fincluding connected RDFP fimctions. This phase is the main subject of this document and 
prnnnntfid in Soction ^. d iscussed in more detail below. 

2^3 Configuration Code Generation 

Finally, the last phase may directly translatesS mislate the CDFG to configuration code used 
to program the RDFP. For PACT XPP™ Cores, the configuration code ismay be generated 
as an NML (Native Mapping Language) filerfi le^ 
3 Configurable Objects and Functionality of aan RDFP 

This s o ction describ e s tho configurable obj e cts and functionalit>^ of a RDFP. A possible 
implementation of the RDFP architecttire is a PACT XPP^m Core. Hefe-weDiscussed herein 
are only describ e t he minimum requirements for aan RDFP for this compilation method to 
work. The only data types considered are multi-bit words called data and single-bit control 
signals called events. Data and events are always processed as packets , cf. Section 3. 2.^ 
See that which is discussed below xmder the heading 'Tacket-b ased Communication 
Network." Event packets are called 1 -events or 0-events, depending on their bit-value. 
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g4 Configurable Objects and Functions 

An RDFP consiGts o fi ncludes an array of configurable objects and a communication network. 
Each object can be configured to perform certain fiinctions-^ . such as those listed below). It 
porforms mav perform the same fimction repeatedly until the configuration is changed. The 
array needsneed not be completely uniform, z. e. ^ not all objects need to be able to perform 
all functions. E^-g ^For example, a RAM function can be implemented by a specialized 
RAM object wWebthat cannot perform any other functions. It is also possible to combine 
several objects to a "macro" to realize certain functions. ^;evei=atFQr example, several RAM 
objects ca n, 0. g. , be combined to fe^ize obtain a^ RAM function v^th larger storage. 

Tho followinQ Fig. 54 is a graphical representation of functions for processing data and event 
packets that can be configured into an RDFP. Soo Fig. 1 for a graphical roprooontation. TTie 
functions are as follows. 

• ALU[opcode]: ALUs perform common arithmctical arithmetic and ^logical operations on 
data. ALU functions ("opcodes") must be available for all operations used in the 
HLL.[[*]] Otherwise, programs including operations that do not have ALU opcodes in 
the RDFP must be excluded from the supported HLL s ubset or substituted by "macros" of 
existing functions. A LU functions have two data inputs^ A and B, and one data output^ 
X. Comparators have an event output U instead of the data output. They produce a 1- 
event if the comparison is true, and a 0-event otherwise. 

• CNT: A CNT is a counter fimction which has data inputs^ LB, UB, and INC (lower bound, 
upper bounds and increment)^ and data output X (counter value). A packet at event input 
START starts the counter, and event input NEXT causes the generation of the next output 
value (and output events) or causes the counter to terminate if UB is reached. If NEXT is 
not connected, the counter eeunt smav count continuously. The output events U, V, and 
W have the following functionality^^For a counter counting N times, N-1 0-events and 
one 1 -event af emav be generated at output U. At output V, N 0-events af emay be 
generated, and at output W, N 0-events and one 1 -event afemav be created. The 1 -event 

* Otherwis e programs containing oporationa which do not havo ALU opcodeG in tho RDFP must be 

excluded from tho oupportod HLL subsot or substitutod by 'Macros" of e xisting functions. 
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at W is only created after the counter has terminated, i.e. ^ a NEXT event packet was 
received after the last data packet was output. 

• RAM[size]: The RAM ftinction stefe smav store a fixed number of data words ("size"). It 
has a data input RD and a data output OUT for reading at address RD. Event output ERD 
signals completion of the read access. For a write access, data inputs WR and IN (address 
and value) and data output OUT i smav be used. Event output EWR signals completion of 
the write access. ERD and EWR always generate 0-events. Note that external RAM can 
be handled as RAM fimctions exactly like internal RAM. 

• GATE: A GATE nynchronizes mav synchronize a data packet at input A back and an 
event packet at input E. When both inputs have arrived, they ^may both be^consumed. 
The data packet i smav be copied to T-output X, and the event packet to output U. 

• MU X: An MUX ftinction has mav have 2 data inputs^ A and B, an event inputs SEL, and a 
data output^ X. If SEL receives a 0-event, input A i smav be copied to output X and input 
B. and input B may be discarded. For a 1 -event, B i smav be copied^ and A may be 
discarded. 

• MERGE: A MERGE function ha smav have 2 data inputs^ A and B, an event input SEL, 
and a data output X. If SEL receives a 0-event, input A i smav be copied to output X, but 
input B is not discarded. The packet i smav be left at the input B instead. For a 1 -event, B 
i smav be copied and A left at the input. 

• DEMUX: A DEMUX ftinction ha smav have one data input A, an event input SEL, and 
two data outputs X and Y. If SEL receives a 0-event, input A ismay be copied to output 
X, and no packet is created at Output Y. For a 1 -event, A may be copied to Y, and no 
packet is copied to Y, and no pack e t is created at output X. 

• MDATA: A MDATA ftinction multiplicateG. m av multiplicate data packets. It hasmay 
have a data input A, an event input SEL, and a data output X. If SEL receives a 1 -event, 
a data packet at A i smav be consumed and copied to output X. For all subsequent 0- 
eventevents at SEL, a copy of the input data packet i smav be produced at the output 
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without consuming new packets at A. Only if another 1 -event arrives at SEL, the next 
data packet at A is mav be consumed and copied.[[^]1 It is noted th at this can be 
im plemented bv a MERGE with special properties on XPP^^> 

• INPORT[name]: Roceiv e s A n INPORT function may receive data packets from^ outside 
the RDFP through input port "name" and e^ie smav copy them to data output X. If a 
packet was received, a 0-event i smav be produced at event output U, too. (Netelt is noted 
that this function can only be configured at special objects connected to external busses. ) 

• OUTPORT[name] : ^^efid sAn OUTPORT function may send data packets received at data 
input A to the outside of the RDFP through output port "name/V If a packet was sent, a 
0-event is mav be produced at event output U, too. (Neteltisnoted that this function can 
only be configured at special objects connected to extemal busses.) 

Additionally, the following functions manipulate only event packets: 

• 0-FILTER, 1 -FILTER: A FILTER ba smav have an input E and an output U. A 0- 
FILTER eepie smav copy a 0-event from E to U, but 1-EVENTs at E are discarded. A 1- 
FILTER eepie smav copy 1 -events and discards discard 0-events. 

• INVERTER: C^f^fne sAn INVERTER may copy all events from input E to output but 
invnrtr. itr. valu e , invert their values. 

• 0-CONSTANT, 1 -CONSTANT: 0-CONSTANT copies all events from input E to output 
I J hut changes mav copy all events from input E to output U, but may change them all to 
value 0. 1 -CONSTANT eha^ge smav change them all to value 1 . 

• ECOMB: rombines ECQMB may combine two or more inputs El, E2, E3 . . producing 
a packet at output U. The output i smav be a 1 -event if and only if one or more of the 
input packets are 1 -events (logical or). A packet must be available at all inputs before an 

* Note that this can bo implcmontod by a MERGE with spocial proporties on XPPtm. 
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output packet is produced.rr^H It is noted that this function mav h e implemented by the 
RAND operator on the XPP^^m 

• ESEQ[seq]: An ESEQ gen e rat e o mav generate a sequence "seq" of events, e.g.:, 
5 "0001 ". 0001." at its output U. If it has an input START, one entire sequence ismay be 

generated for each event packet arriving at U. The sequence is only repeated if the next 
event arrives at U. However, if START is not connected, ESEQ majLconstantly 
f^e ^repeat the sequence. 

10 Net elt is noted that the ALU, MUX, DEMUX, GATE and ECOMB functions may behave 
like their equivalents in ninnrirni Hitnflnw mnp.hinor . [1. ^]. conventional dataflovy machines. 
In this regard, see A.H. Veen. "Dataflow Architecture." ACM Co mputine Surveys. 18(4) 
rPecember 1986): and S.J. Allan & A.E. Oldehoeft. "A Fl ow Analysis Procedure for the 
Translation of High-Level Languages to a Data Flow Language." lE KF Transactions on 

15 Comimters. C-29(9^:&26-S3^ rSeotember 1980). 

3,3 Packet-based Communication Network 

The communication network of an RDFP can connect an-outputs of one object, (i. e. , its 
respective function), to the input(s) of one or several other objects. This is usually achieved 

20 by busses and switches. By placing the fiinctions properly on tiie objects, many functions can 
be connected arbitrarily up to a limit imposed by the device size. As mentioned above, all 
values ar emay be communicated as packets. A separate communication network ejfistsmay 
exist for data and event packets. The packets jnay synchronize the ftmctions as in a dataflow 
machine with arVnnwIpHge . [1]. T.o. . In this regar d , see A.H. Veen, .y»pra. That is , the 

25 function only executes when all input packets are avail abloavailable (apart from the non- 
strict exceptions as described above). The function may also staHsstaU if the last output 
packet has not been consumed. Therefore, a data-flow graph mapped to an RDFP majLself- 
r.ynchroniz6 ssvnchronize its execution without the need for external control. Only if two or 
more functionj outputs (data or event) are connected to the same function input ("N to 1- 

30 connection"),is the self-synchronization i&<lisabled.[[*]] It is noted that on XPP™ Cores, an 
"N to 1 connection" for events is realized bv the EOR fimction. and, for data, by just 

* Note that this function is implomontod by the EA>)D operator on tho XPP™ t 
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assigning several outputs to an input. T he user has to ensure that only one packet arrives at a 
time in a correct CDFG. Otherwise, a packet might get lost, and the value resulting from 
combining two or more packets is- undefined. However, a function output can be connected 
to many function inputs ("1 to N connection") without problems. 

5 

There are some special cases: 

• A function input can be preloaded with a distinct value during configuration. This packet 
i smav be consumed like a normal packet coming from another object. 

• A ftmction input can be defined as constant. In this case, the packet at the input i smay be 
10 reproduced repeatedly for each function execution. 

An RDFP fequifes mav require register delays in the dataflow. Otherwise, very long 
combinational delays and asynchronous feedback is possible. Wo asoum e lt is assumed that 
delays are inserted at the inputs of some functions (like for most ALUs) and in some routing 
1 5 segments of the communication network. Nete lt is noted that registers majLchange the 
tuning, but not the functionality, of a correct CDFG. 

4 Configuration Generation 

AA Language Definition 

20 The following HLL features are not supported by the method described heFeherein: 

• pointer operations 

• library calls, operating system calls (including standard I/O functions) 

• recursive function calls (Not e that non-recursive function calls can be eliminated by 
function in-lining and therefore are not considered - her e , herein. ) 

25 • All scalar data types af emav be converted to type integer. Integer values afe may be 

equivalent to data packets in the^ RDFP. Arrays (Pessiblypossibly multi-dimensional) 
are the only composite data types considered. 

The following additional features are supported: 



^ Mote that on XPP^'^ ^ "^^ ^ rnnnnrtinn" fnr evontG is roalizod by tho FOR fiinction, and for 

data by just assigning sev e ral outputs to an input. 
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INPORTS and OUTPORTS can be accessed by the HLL functions getstream(name, value) 
and putstream(name, value):, respectively. 

Mapping of High-Level Language Constructs 

This method conv o rts a mav convert an HLL program to a CDFG consisting ofmcluding the 
RDFP functions defined in Soction 3. L t he discussion under t he heading "Configurable 
Objects and Functions." Before the processing starts, all HLL program arrays aremay be 
mapped to RDFP RAM functions. An array x i smav be mapped to RAM RAM(x). If several 
arrays are mapped to the same RAM, an offset i smav be assigned, too. The RAMs afemay be 
added to an initially empty CDFG. There must be enough RAMs of sufficient size for all 
program arrays. 

The CDFG is mav be generated by a traversal of the AST of the HLL program. It 
processes mav process the program statement by statement and dosc e nds descend into the. 
loops and conditional statements as appropriate. The following iv^o pieces of information 
^mav be updated at every program point[[^11 . Twhich refers to a point between tv^o 
statements or before the beginning or after the end of a progr am component such as a loop or 
a conditional statement), during the traversal: 

• START peiftt smav point to an event output of aan RDFP function. This output 
deliveF smav deliver a 0-event whenever the program execution reaches this program 
point. At the beginning, a 0-CONSTANT preloaded with an event input ismay be added 
to the CDFG. (It deliveF smav deliver a 0-event immediately after configuration.} ) 
STARTjnay initially peiftts point to its output. This event ismav be used to start the 
overall program execution. TheA STARTmw signal generated after a program part has 
finished executing i smav be used as new START signal for the following program parts, 
or it sigftal smav signal termination of the entire program. The START events may 
guarantee that the execution order of the original program is maintained wherever the 
data dependencies alone are not sufficient. This scheduling scheme ismay be similar to a 
one-hot controller for digital hardware. 

• VARLIST i smav be a list of {variable, function-output} pairs. The pairs may map 
integer variables or array elements to a CDFG function's output. The first pair for a 

* In a program, program points ar e bot\%^oon two Gtatoments or bcforo mo boginning or aftor tho ond of a 

program component lico a loop or a conditional statem e nt. 
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variable in VARLIST eeHtmHs mav contain the output of the function which produces the 
value of this variable valid at the current program point. New pairs aremay be always 
added to the front of VARLIST. The expression VARDEF(var) refers mav refer to the 
function-output of the first pair with variable var in V ART JST[[^]1 With respect to this 
way of using a VARLIST. see D. Galloway, ^The transmogrifier C hardware description 
lan guage and compiler for FPGAs," Proc, FPGAs for Custom Co mputing Machines. 
IEEE Computer Society Press, 1995, at 136-44. 

Tho following Gubsoctions B elow are systematically list all HLLlisted HLL program 
components and d e scrib e descriptions of how they ^may be processed, thereby altering the 
CDFG, START, and VARLIST. 

4^24 — Integer Expressions and Assignments 

Straight-line code without array accesses can be directly mapped to a data-flow graph. One 
ALU ismayjbe allocated for each operator in the program. Because of the self- 
synchronization of the ALUs, no explicit control or scheduling is needed. Therefore 
processing these assignments does not access or alter START. The data 
dopondonces dependencies (as they would be exposed in the DAG representation of the 
prngram44-]4-af e- in regard to which see A.V. Aho et al., supra) may be analyzed through the 
processing of VARLIST. These assignments may synchronize themselves through the data- 
flow. The data-driven execution may automatically oxploits exploit the available instruction 
level parallelism. 

All assignments may evaluate the right-hand side (RHS) or source expression. This 
evaluation restrit smav result in a pointer to a CDFG object's output (or pseudo-object as 
defined below). For integer assignments, the left-hand side (LHS) variable or destination 
i smav be combined with the RHS result object to form a new pair {LHS, result(RHS)} which 
i smav be added to the front of VARLIST. 

For the following examples, C syntax is used. The simplest statement kmay be a constant 
assigned to an integer: [[^]] 

* This method of using a VARLIST is adapted from tho TranGmogrifior C compiler [5], 

' Note that wo us e C syntax for tho following examples. 
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a = 5; 

It deesft ^does not change the CDFG, but adds {a, 5} to the front of VARLIST. :^The 
constant 5 is —a "pseudo-object" which only holds the value — but ^oes not refer to a 
CDFG object. Now VARDEF(a) equals 5 at subsequent program points before a is 
5 redefined. 

Integer assigiunents can also combine variables already defined and constants: 
b = a*2 + 3; 

In the AST, the RHS is already converted to an expression tree. This tree ismay be 

10 transformed to a combination of old and new CDFG objects (which are added to the CDFG) 
as folio ws^^Each operator (internal node) of the tree i smav be substituted by an ALU with 
the opcode corresponding to the operator in the tree. If a leaf node is a constant, the ALU's 
input is mav be directly connected to that constant Jf a leaf node is an integer variable var, it 
i smav be looked up in VARLIST, i.e. , VARDEF(var) is retrieved. Then VARDEF(var) (an 

1 5 output of an already existing object in CDFG or a constant) ie mav be connected to the ALU's 
input. The output of the ALU corresponding- to the root operator in the expression tree is 
defined as the result of the RHS. Finally, a new pair {LHS, result(RHS)} iemay be added to 
VARLIST. If the two assignments above are processed, the CDFG with two ALUs, as shown 
in Fig 2tr. croatod. * 55. mav be created. It is noted that the input and ou tput names can be 

20 deduced from their position. It is futher noted that the compiler front end would normally 

have substituted the second assignment bv b = 13 (constant pro pagation! For the simplicity, 
no frontend optimizations are considered in this and the fo llowing examples. Outputs 
occurring in VARLIST are labeled by Roman numbers. After these two assignments, 
VARLIST = [{b, I}, {a, 5}]. (The front of the list is on the left side.) Note that all inputs 

25 connected to a constant (whether direct from the expression tree or retrieved from VARLIST) 
must be defined as constant ^Inputs defined as constants have a small c next to the input 
arrow in Fig.-27 — 55. 

AAA — Conditional Integer Assignments 

» Not e that tho input and output names can bo d e ducod from thoir position, cf. Fig. 1. Also noto that th e 

compiler front e nd would normally have substituted, tho second a s signmont by b - 3.3 (constant propagation). 
For tho simplicity of this explanation, no frontend optimizations are considered in this and tho following 
e xampl e s. 
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For conditional if-then-else statements containin g inciuding only integer assignments, objects 
for condition evaluation af emav be created first. The object event output indicating the 
condition result i smav be kept for choosing the correct branch result later. Next, both 
branches ^ mav be processed in parallel, using separate copies VARLISTl and VARLIST2 
ofVARLIST. (VARLIST itself is not changed.) Finally, for all variables added to 
VARLISTl or VARLIST2, a new entry for VARLIST i smav be created (combination phase). 
The valid definitions from VARLISTl and VARLIST2 af emav be combined with a MUX 
function, and the correct input i smav be selected by the condition result. For variables only 
defined in one of the two branches, the multiplexer ttse smav use the result retrieved from the 
original VARLIST for the other branch. If the original VARLIST does not have an entry for 
this variable, a special "undefined" constant value i smav be used. However, in a function 
aHy functionallv correct program^ this value will never be used. As an optimization, only 
variables live f4 4(see A.V. Aho et al.. suvrd) after the if-then-else structure need to be added 
to VARLIST in the combination phase.rr^ll A variable is live at a program point if its value is 
read at a statement reachable from the point without intermed iate redefinition. 

Consider the above with respect to the following example: 
i = 7; 
a = 3 ; 

if (i < 10) { 
a = 5 ; 
c = 7; 

} 

else { 

c = a -1; 
d = 0; 

} 

For this example, Fig.-^ 56 shows the resulting CDFG. Before the if-then-else construct, 
VARLIST = [{a, 3}, {i, 7}]. After processing the branches, for the then branch, VARLISTl 

9 Dofinition: A variabl e io livo at a program point Wits valuo is read at a statem e nt r e achabl e from h e r e 

without intermediato rodefmition. 
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= [{ c, 7}, {a, 5}, {a, 3}, {i, 7}], and for the else branch, VARLIS72 = [{d, 0}, {c, I }, {a, 3}, 
{i, 7 }]. After combination, VARLIST = [{d, II }, {c, III}, {a, IV}, {a, 3}, {i, 7}]. 



Note that case- or switch-statements can be processed, too, since they can —be converted, 
without loss of generality bo convert e d^ to nested if-then-else statements. 

Processing conditional statements this way does not require explicit control and does not 
change START, Both branches ^mav be executed in parallel and synchronized by the data- 
fle¥i dataflow . It is possible to pipeline the dataflow for optimal throughput. 

4^7$ — General Conditional Statements 

Conditional statements containin g including either array accesses ( cf. Section^. 2. 7 
bek^w see the discussion below under the heading ^^Array Accesses") or inner loops cannot 
thebe processed as described in 9 . nctinn 1. 2. 2. above under the heading "Condition al Integer 
Assignments," Data packets must enlj^be sent only t o the active branch. This ismay be 
achieved by the implementation shown in Fig. 8. Fig. 57. similar to the method presented in 
{4]t -S.J. Allan et al., supra, 

A dataflow analysis i smav be performed to compute used sets use and defined sets def [1] of 
hnth hranchog. ^ (see A.V. Aho et al.. suvrd) of both branches. A var iable is used in a 
statement fand hence in a program region including the s tatement) if its value is read. A 
variable is defined in a statement (or region) if a new valu e is assigned to it. For the current 
VARLIST entries of all variables in /Ar=w5'e(1r/ie«6ot/j;; U def (thenbody) U use(elsebody) 
U def(elsebody) U use(header), DEMUX functions controlled by the IF condition are 
inserted. Nete lt is noted that arrows with double lines in Fig.-8J7 denote connections for all 
variables in IN, and the shadedshadowed DEMUX function stands for several DEMUX 
functions, one for each variable in IN. The DE- MUX functions forward data packets only to 
the selected branch. New lists VARLISTl and VARLIST2 are compiled with the respective 
outputs of these DEMUX functions. The then-branch is processed with VARLISTl, and the 
else branch with VARLIST2. Finally, the output values are combmed. OUT 
eefttmnsincludes the new values for the same variables as in IN. Since only one branch is 

^ A variable is uG e d in a Gtatom e nt (and h e nc e in a program rogion containing this stat e ment) if its value 

is read. A variabl e is dcfmod in a stat e m e nt (or region) if a new valuo is assign e d to it. 
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ever activate^ there will not be a conflict due to two packets arriving simultaneously. The 
combinations will be added to VARLET VARLIST after the conditional statement. If the IF 
execution shall be pipelined, MERGE opcodes for the output must be inserted, too. They are 
controlled by the condition like the DEMUX fimctions. 

5 

Tho following extension withWife respect to [1] io added (dotted lines in Fir. 8)that which is 
dicussed in S.J. Allan et al.. suvra. the foUovying extension, corresp onding to the dashed lines 
of Fig. 57 may be added m an embodiment of the present invention in order to control the 
execution as mentioned above with START events^-^The START input ismaybe ECOMB- 

1 0 combined with the condition output and connected to the SEL input of the DEMUX 
fvmctions. The START inputs of thenbody and elsebody aremav be generated from the 
ECOMB output sent through a 1 -FILTER and a 0-CONSTANT[['']] or through a 0-FILTER, 
respectively. (The 0-CONSTANT mav be required since START events must always be 0- 
events.') The overall START„ew output is mav be generated by a simple "2 to 1 connection" of 

15 thenbody's and elsebody's START„e^ outputs. With this extension, arbitrarily nested 
conditional statements or loops can be handled within thenbody and elsebody. 

AAA — AWHILE Loops 

WHILE loops af emav be processed similarly to the scheme presented in [1], of Fig. 9. As 
20 in Snotinn ^ dnuhle S.J. Allan et al- suvra (see F ig. SSV Double line connections and 
shade dshadowed MERGE and DEMUX functions represent duplication for all variables in 
IN. Here IN = use(whilebody) U def (whilebody) U use(header). The WHILE loop 
exeeute smav execute as foUowsr-^In the first loop iteration, the MERGE functions m^ 
select all input values from VARLIST at loop entry (SEL=0). The MERGE outputs afemay 
25 be connected to the header and the DEMUX functions. If the while condition is true (SEL=1), 
the input values wemaybe forwarded to the whilebody- and otherwise to OUT. The output 
values of the while body af emav be fed back to whilebody' s input via the MERGE and 
DEMUX operators as long as the condition is true. Finally, after the last iteration, they 

12 • 

af emav be forwarded to OUT. The outputs afema^be added to the new VARLIST, [[ ]]Itis 

^ Tho 0 CONSTANT is rcquir o d s inco START e ventG must always bo O.ovents. 

« Noto that tho MERGE function for variablos not livo at tho loop's beginning and th o whil e body's 

beginning can b e romovod oinco its output is not us e d. For thoso variabloG, only tho DENiUX fiinction to output 
tho final value is roquirod. Also note that tho MERGE functions can bo replaced by simple "2 to 1 conn e ctions" 
if th o configuration process guarantees that packets from IMl always arrivo at tho DEMUX's input b e fore 
feedback values arriv e . 
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noted that the MERGE function for variables not live at the loop's beginning and the 
whilebodv's beginning can be removed since its output is not used. For these variables, only 
the DEMUX function to output the final value is required. It is further noted that the 
MERGE functions can be replaced by simple "2 to 1 connections'' if the configuration 
5 process guarantees that packets from INI alv^avs arrive at the DEMUX' s input before 
feedback values arrive. 

Two oxtonoions with resp e ct to [1] are add e d (dott e d lines in Fir. 9): 



With respect to that which is dicussed in S.J. Allan et aL. supra, the follovsdng two 
10 extensions^ corresponding to the dashed lines in Fig. 58, may be added in an embo diment of 
the present invention. 

• In fflS J. Allan et al., suvra. the SEL^ input of the MERGE functions is preloaded with 
0. Hefte eThus. the loop execution begins immediately and can be executed only once. 
1 5 Instead, wo corm e c t in an embodiment of the present invention, the START inpu t may be 

connected to the MERGE' s SEL input ("2 to 1 connection" wdth the header output). This 
allows to may allow control of the time of the start of the loop execution and t emay allow 
its restart4t. 

20 • The whilebody's START input is may be connected to the header output, sent through a 
1-FILTER/O-CONSTANT combination as above (generates a 0-event for each loop 
iteration). By ECOMB-combining- whilebody's START^^ output with the header output 
for the MERGE functions' SEL inputs, the next loop iteration is only started after the 
previous one has finished. The while loop's STARTmw output is generated by filtering the 

25 header output for a 0-event. 

With these extensions, arbitrarily nested conditional statements or loops can be handled 
within whilebody. 

30 4 .2.5 FOR Loops 

FOR loops are particulariy regular WHILE loops. Therefore wo could handle thorn, thev may 
handled as explained above. However, our RDFP features th e an RDFP according to an 
embodiment of the present invantion may feature a special counter function CNT and thea 
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data packet multiplication function MDATA^ which can be used for a more efficient 
implementation of FOR loops. This new FOR loop scheme is shown in Fig.-tOr-J9, 

A FOR loop i smav be controlled by a counter CNT. The lower bound (LB), upper bound 
(UB), and increment (INC) expressions ore evaluated lilco any other oxproooiono (oco Sections 
i 0 1 nndl. 2. 7 mav be evaluated like any other expression (see, for example, that which 
is discussed above under the heading "Integer Expressions an d Assignments." and that which 
is discussed below under the heading "Arrav Accesses") and connected to the respective 
inputs. 

As opposed to WHILE loops, a MERGE/DEMUX combination is only required for variables 
in im = deftforbody), i. e. , those defined in forbody. [['^]] TbU dooo not containit is noted 
that the MERGE functions can be replaced by simple "2 to 1 connections" as for WHILE 
loops if the configuration process guarantees that packets from IN I always arrive at the 
DEMUX's input before feedback values arrive. INI does not include variables which are 
only used in forbody, LB, UB, or INC, and dees-also does not eeataiainclude the loop index 
variable. Variables in INI afe -mav be processed as in WHILE loops, but the MERGE and 
DEMUX functions' SEL input is connected to CNT's W output. (The W output deesmay do 
the inverse of a WHILE loop's header nntpnt - it nntpufri a 1 . It mav output a 1 -event after the 
counter has terminated. Therefore^ the inputs of the MERGE functions and the outputs of the 
DEMUX functions af emav be swapped here, and the MERGE functions' SEL inputs aFemay 
bepreloaded with 1 -events. ) 

CNT's X output provid e s mav provide the current value of the loop index variable. If the 
final index value is required (live) after the FOR loop, it i smav be selected with a DEMUX 
function controlled by CNT's U event output (which prnducenmav produce one event for 
every loop iteration). 

Variables in 7^2 = use(forbody) \ defiforbody). i. e. , those defined outside the loop and only 
used (but not redefined) inside the loop-me -. mav be h andled differently. Unless it is a 
constant value, the variable's input value (from VARLIST) must be reproduced ^in each loop 
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iteration since it is consumed in each iteration. Otherwise, the loop would stall from the 
second iteration onwards. The packets are mav be reproduced by MDATA functions, with the 
SEL inputs connected to CNT's U output. The SEL inputs must be preloaded with a 1 -event 
to select the first input. The 1 -event provided by the last iteration seleetsmay select a new 
5 value for the next execution of the entire loop. 

The following control events f corresponding to the d otted lines in Fig.^ J9) are similar to 
the WHILE loop extensions, but simpler. CNT's START input ismaybe connected to the 
loop's overall START signal. START,^ ismaybe generated from CNT's W output, sent 
1 0 through a 1 -FILTER and 0-CONSTANT. CNT's V output pfedueesmay produce one 0- 
event for each loop iteration and i smav therefore be_used as forbody's START. Finally, 
CNT's NEXT input i smav be connected to forbody's START„ew output. 

For pipelined loops (as defined below inSootionl. 2. 6) ;under the heading "Vectorization 
1 5 and Pipelining'"), loop iterations ar emav be allowed to overlap. Therefore, CNT's NEXT 
input need sneed not be connected. Now the counter pfedueesmay produce index variable 
values and control events as fast as they can be consumed. However, in this case CNT's W 
output m not sufficient as overall START^ output since the counter terminates before the last 
iteration's forbody finishes. Instead, START,^ i smav be generated from CNT's U output 
20 ECOMB-combined with forbody' s START„ey^ output, sent through a 1 - FILTER/0- 

CONSTANT combination. The ECOMB pmd«ee smav produce an event after termination of 
each loop iteration, but only the last event is a 1 -event because only the last output of CNT's 
U output is a 1 -event. H^ie eThus. this event wwiiertesmav indicate that the last iteration has 
finished. Cf Soction ^. 3 for oA FOR loop example compilation with and without 
25 pipelining — is provided below under the heading "More Examples." 

As for WHILE loops, these methods allow to procoao for arbitrarily processing nested loops 
and conditional statements. The following advantages over WHILE loop implementations 
af emav be achieved: 

Not o that the f IF R T rF f i- "^*i""'- -"^ ^" mpHrnrl hi. nimplQ "2 to I connections" as for WIILE loop s if 
the configuration process guarantees that packots from YbU always arrivo at the DEMUX's input boforo 
foodback values arriv e . 
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• One index variable value i smay be generated by the CNT function each clock cycle. This 
is faster and smaller than the WHILE loop implementation which allocates a 
MERGE/DEMUX/ADD loop and a comparator for the counter functionality. 

• Variables in IN2 (only used in forbody) af emav be reproduced in the special MD ATA 
functions and need not go through a MERGE/DEMUX loop. This is again faster and 
smaller than the WHILE loop implementation. 

— ^Vectorization and Pipelining 

nrho mothod in the embodiments described so far gonoratep above^ CDFGs Performing are 
generated that perform the HLL program's functionality on an RDFP. However, the program 
execution is unduly sequentialized by the START signals. In some cases, innermost- loops 
can be vectorized. This means that loop iterations can overlap, leading to a pipelined 
dataflow through the operators of the loop body. The Pipeline Vector ization technique {6} 
ea ft(see Markus Weinhardt et aL. "Pipeline Vectorization." suvrd) can be easily applied to the 
compilation method pronontod here of embodiments of t h e present invention . As mentioned 
above, for FOR loops, the CNT's NEXT input i smav be removed so that CNT counts 
continuously, thereby overlapping the loop iterations. 

All loops without array accesses can be pipelined since the dataflow automatically 
synchronizes lccp-rarri?{i f^npninr^omrnw^ \ o dc^pQr\d&nc&F.dependencies, i.e.. dependencies 
between a statement in one iteration and another statement in a subsequent iteration. Loops 
with array accesses can be ^pipelined if the array, (/. e. , RAM), accesses do not cause loop- 
carried dopcndonces dependencies or can be transformed to such a form. In this case, no 
RAM address is written in one iteration and read in a subsequent iteration. Therefore, the 
read and write accesses to the same RAM may overlap. This degree of freedom is exploited 
in the RAM access technique described below. Especially for dual-ported RAM, it leads to 
considerable performance improvements. 

4Ai^ — ^Array Accesses 

In contrast to scalar variables, array accesses have to be controlled explicitly in order to 
maintain the program's correct execution order. As opposed to normal dataflow machine 
models p}- ^(see A H. Veen, suvrd). an RDFP does not have a single address space. Instead, 
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the arrays ^ mav be allocated to several RAMs. This leads to a different approach to 
handling RAM accesses and opens up new opportunities for optimization. 



To reduce the complexity of the compilation process, array accesses afemay be processed in 
two phases. Phase 1 use smav use "pseudo-functions" for RAM read and write accesses. A 
RAM read function bas^ amav have an RD data input (read address) and an OUT data output 
(read value), and a RAM write function ba smav have WR and IN data inputs (write address 
and write value). Both functions are labeled with the array the access refers to, and both may 
have a START event input and a U event output. The events may control the access order. 
In Phase 02^ all accesses to the same RAM af emav be combined and substituted by a single 
RAM functio n as showTi in Fig. 1 This involv e s mav involve manipulating the data and 
event inputs and outputs such that the correct execution order is maintained and the outputs 
are forwarded to the correct part of the CDFG. 

• Phase It 

Pfeft$e4-Since arrays ar emav be allocated to several RAMs, only accesses to the same RAM 
have to be synchronized. Accesses to different RAMs can occur concurrently or even out of 
order. In case of data dependencies, the accessesjnay self-synchronize automatically. 
Within pipelined loops, not even read and write accesses to the same RAM have to be 
synchronized. This i smav be achieved by maintaining separate START signals for every 
RAM or even separate START signals for RAM read and RAM write accesses in pipelined 
loops. At the end of a basic hlnrk-p^-^ , all STAP..Tnow . which is a pro g r am part w ith a 
single entry and a single exit point i.e,, a piece of straig ht-line code, (see A.V. Aho et al., 
su praX all START„..v . outputs must be combined by a-mLECOMB to provide a START signal 
for the next basic blocks which guarantees that all array accesses in the previous basic block 
are completed. For pipelined loops, this condition can even be relaxed. Only after the loop 
exit^ all accesses have to be completed. The individual loop iterations need not be 
synchronized. 

First the RAM addresses af emav be computed. The compiler frontend's standard 
transformation for array accesses can be used, and a CDFG function's output ismay be 

w A basic block is a program part with a single ontp^ and a singl e oxit point, i.o. a pi e c e of straight lin e 
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generated which provid e s mav provide the address. If applicable, the offset with respect to 
the RDFP RAM (as determined in the initial mapping phase) must be added. This output 
is mav be connected to the pseudo RAM read's RD input (for a read access) or to the pseudo 
RAM write's WR input (for a write access). Additionally, the OUT output (read) or IN input 
5 (write) i smay be connected . The START input is mav be connected to the variable's START 
signal, and the U output i smav be used as STARTnew for the next access. 

To avoid redundant read accesses, RAM reads afe -mav also be registered in VARLIST. 
Instead of an integer variable, an array element i smav be used as the first element of the pair. 
10 However, a change in a variable occurring in an array index invalidates the information in 
VARLIST. It must then be removed from it. 

The following example with two read accesses compiles to the intermediate CDFG shown in 
60. The START signals refer only to variable a. STOPl is the event connection 
15 which synchronizes the accesses. Inputs START (old), i^ and j should be substituted by the 
actual outputs resulting from the program before the array reads. 

X = a [i] ; 
y = a [j ] ; 
20 z = x+y; 

Fig,^61 shows the translation of the following write access [[:]] a[i] = x;^ 
• Phase 2; 

25 Pha s e 2 Wo now merg e th eThe pseudo-functions of all accesses may be merged t o the same 
RAM an d mav be substitute thom b y a single RAM function. For all data inputs (RD for read 
access and WR and IN for write access), GATEs ar e GATEs mav be inserted between the 
input and the RAM function. Their E inputs af emav be connected to the respective START 
inputs of the original pseudo-functions. If a RAM is read and written at only one program 

30 point, the U output of the- read and write access i smav be moved to the ERD or EWR output, 
respectively. For example, the single access a [i] = x; from Fig.^34 s61 maybe transformed 
to the final CDFG shown in Fig.-§^ 62. 
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However, if several read or several write accesses^ (/. e.^ pseudo-fiinctions from different 
program points) to the same RAM occur, the ERD or EWR events are not specific anymore. 
But a STARTnew event of the original pseudo function should only be generated for the 
respective program point, /.e.^. for the cur r e n t current access. This ismay be achieved by 
connecting the START signals of all other accesses (pseudo-functions) of the same type (read 
or write) with the inverted START signal of the current access. The resulting signal 
pmAtee smav produce an event for every access, but a 1 -event for o nly fi^the current access 
a 1 event. This event i smav be ECOMB-combined with the RAM's ERD or EWR output. 
The ECOMB's output will only occur after the access is completed. Because ECOMB OR- 
combines its event packets, only the cvirrent access produces a 1 -event. Next, this event 
ismaybe filtered with a 1 -FILTER and changed by a 0-CONSTANT, resulting in a START„ew 
signal which produces a 0-event only after the current access is completed as required. 

For several accesses, several sources ^mav be connected to the RD, WR^ and IN inputs of a 
RAM. This feable smav disable the self-synchronization. However, since only one access 
occvirs at a time, the GATEs only allow one data packet to arrive at the inputs. 

For read accesses, the packets at the OUT ou^ut face the same problem as the ERD event 
par.kets4-The v. which is that thev occur for every read access, but must enly-be used (and 
forwarded to subsequent operators^ onlv for the current access. This can be achieved by 
connecting the OUT output via a DEMUX function. The Y output of the DEMUX i smay be 
used, and the X output t smav be left unconnected. Then it aetsmayact as a selective gate 
which only forwards packets if its SEL input receives a 1 -event, and discards its data input if 
SEL receives a 0-event ^The signal created by the ECOMB described above for the 
STARTrte,, signal efeates mav create a 1 -event for the current access, and a 0-event otherwise. 
Using it as the SEL input achieves exactly the desired fiinctionality. 

Fig. [[4]]^ shows the resulting CDFG for the first example above (two read accesses), after 
applying the transformations of Phase 2 to Fig.-I^m STOP! js-aewmay be generated as 
follows-^START(old) i smav be inverted, "2 to 1 connected" to STOP 1 (because it is the 
START input of the second read pseudo-function), ECOMB-combined with RAM's ERD 
output and sent through the 1-FILTER/O-CONSTANT combination. START(new)is may_be 
generated similarly, but here START(old) i smav be directly used and STOPl inverted. The 
GATES for input IN (i and j) are mav be connected to START(old) and STOPl, respectively, 
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and the DEMUX functions for outputs x and y ar emav be connected to the ECOMB outputs 
related to STOPl and START(new). 

Multiple write accesses may use the same control events, but instead of one GATE per access 
for the RD inputs, one GATE for WR and one gate for IN (with the same E input) af emay be 
used. The EWR output i smav be processed like the ERD output for read accesses. 

This transformation e»s«fe smav ensure that all RAM accesses are executed correctly, but it is 
not very fast since read or write accesses to the same RAM are not pipelined. The next 
access only starts after the previous one is completed, even if the RAM being- used has 
several pipeline stages. This inefficiency can be removed as foUowsf^ 

Firsts continuous sequences of either read accesses or write accesses (not mixed) within a 
basic block ar emav be detected by checking for pseudo-functions whose U output is directly 
connected to the START input of another pseudo-function of the same RAM and the same 
type (read or write). For these sequences, it is possible to stream data into the RAM rather 
than waiting for the previous access to complete. For this purpose, a combination of MERGE 
functions seleet smav select the RD or WR and IN inputs in the order given by the sequence. 
The MERGES must be controlled by iterative ESEQs guaranteeing that the inputs are only 
forwarded in the desired order. Then only the first access in the sequence needs to be 
controlled by a GATE or GATEs. Similarly, the OUT outputs of a read access can be 
distributed more efficiently for a sequence. A combination of DEMUX functions with ther 
same ESEQ control can be used. It kmav be most efficient to arrange the MERGE and 
DEMUX functions as balanced binary trees. 

The STARTnew signal ts may be generated as folio ws-^For a sequence of length n, the 
START signal of the entire sequence is mav be replicated n times by an ESEQ[00..1] function 
with the START input connected to the sequence's START. Its output ismay be directly "N 
to 1 connected" with the other accesses' START signal (for single accesses) or ESEQ outputs 
sent through 0-CONSTANT (for access sequences), ECOMB-connected to EWR or ERD, 
respectively, and sent through a 1-FILTER/O-CONSTANT combination, similar to the basic 
method described above. Since only the last ESEQ output is a 1 -event, only the last RAM 
access generates a START^w as required. Alternatively, for read accesses, the generation of 
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the last output can be sent through a GATE (without the E input connected), thereby 
producing a STARTmw event. 

Fig.-M_64 shows the optimized version of the first example ( Figuroo 12Figs. 60 and [[4]] 63) 
using the ESEQ-method for generating START„e... and Fig.-6_65 shows the final CDFG of the 
following, larger example with three array reads. HefeTn this embodiment, the latter method 
for producing the START,^ event is used. 

X = a [i] ; 
y = a [ j ] ; 
z = a [k] ; 

If several read sequences or read sequences and single read accesses occur for the same 
RAM, 1 -events for detecting the current accesses must be generated for sequences of read 
accesses. They are needed to separate the OUT-values relating to separate sequences. The 
ESEQ output just defined, sent through a 1-CONSTANT, aehie^may achieve this. It ismay 
be again "N to 1 connected" to the other accesses' START signals (for single accesses) or 
ESEQ outputs sent through 0-CONSTANT (for access sequences). The resulting event 
i smavbe used to control a first-stage DEMUX which is inserted to select the relevant OUT 
output data packets of the sequence as described above for the basic method. Refer to th e 
second oxamplo (Figuroo 15 and 16) in Section 1.3 for aA complete example^ is provided 
below under the heading "More Examples" with reference to Pies. 68 and 69. 

4^ — Input and Output Ports 

Input and output ports ar emav be processed similar to vector accesses. A read firom an input 
port is like an array read without an address. The input data packet ismay be sent to DEMUX 
functions which may send it to the correct subsequent operators. The STOP signal i smay be 
generated in the same way as described above for RAM accesses by combining the 
IMPORT'S U output with the current and other START signals. 

Output ports mav c ontrol the data packets by GATEs like array write accesses. The STOP 
signal i smay also be created as for RAM accesses. 
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More Examples 

Fig.-?_66 shows the generated CDFG for the following for loop. 

a = b+c; 

for (i=0; i<=10; i++) { 
a = a + i ; 
x[i] = k; 

} 

In this example, INI = {a} and IN2 = {k}-(ef. rTn this regard, see Fig.-iOJ8). The MERGE 
fimction for variable a i smav be replaced by a 2:1 data connection as mentioned in the 
fnnt»ntn nf gnntinn 1 S NntP ahove Under the head ing "FOR Loot^s." It is noted that only 
one data packet arrives for variables b, c^ and k, and one final packet is produced for a (out). 
fefbed vForbodv does not use a START event since both operations (the adder and the RAM 
write) are dataflow-controlled by the counter anyway. But the RAM's EWR output i smay be 
the forbody's START„e.. and may be connected to CNT's NEXT input. Netelt is noted that 
the pipelining ^ptimivatinn rf Snntinn ^.2.(^. (see that which i s discussed under the heading 
"Vectorization and Pipelining") was not applied here. If it is applied (which is possible for 
this loop), CNT's NEXT input is not connectedr^f. See_Fig.-44Tj7, Here, the loop 
iterations overlap. START,^ is generated firom CNT's U output and forbody's START„e^, 
(/.e.^ RAM's EWR output), as defined at the end of Section I.2.5. the discussion under the 
heading "FOR Loops." 

The following program eeatwHsindudes a vectorizable (pipelined) loop with one write access 
to array (RAM) x and a sequence of two read accesses to array (RAM) y. After the loop, 
another single read access to y occurs, 
z = 0; 

for (i=0; i<=10; i++) { 
x[i] = i; 

z = z + y [i] + y [2*i] ; 

} 

a = y [k] ; 

Tig.-A&M shows the intermediate CDFG generated before the array access Phase 2 
transformation is applied. The pipelined loop i smav be controlled as foUows^-^Within the 
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loop, separate START signals for write accesses to x and read accesses to y af emay be used. 
The reentry to the forbody is mav also be_controlled by two in dopondentindependent signals 
("cyclel" and "cycle2"). For the read accesses, "cycle2" guarante e smay guarantee that the 
read y accesses occur in the correct order. But the beginning of an iteration for read y and 
write X accesses is not synchronized. Only at loop exit all accesses must be finished, which 
ismaybe guaranteed by signal "loop finished". The single read access ismaybe completely 
independent of the loop. 

Fig.-l^ 69 shows the final CDFG after Phase 2. Net elt is noted that "cyclel" is removed 
since a single write access needs no additional control, and "cycle2" is removed since the 
inserted MERGE and DEMUX functions automatically guarantee the correct execution order. 
The read y accesses are not independent anymore since they all refer to the same RAM, and 
the functions have been merged. ESEQs have been allocated to control the MERGE and 
DEMUX functions of the read sequence, and for the first-stage DEMUX functions which 
separate the read OUT values for the read sequence and for the final single read access. The 
ECOMBs, 1 -FILTERS, 0-CONSTANTs and 1 -CONSTANTS are allocated as described ift 
ynntinn A n 7 Phnr.n 2. with respect to Phase 2 unde r the heading "Array Accesses" to 
generate correct control events for the GATEs and DEMUX functions. 
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ABSTRACT 

Tn a data-processing method, first result data may be obtained using a plurality of 
configurable coarse-granular elements, the first result data may be written into a memory that 
includes spatially separate first and second memor y areas and that is connected via a bus to 
the plurality of configvirahle coarse-granular elements, the fi rst result data may be 
subsequently read out from the memory, and the fi rst result data may be subsequently 
processed using the plurality of confiorurable coa rse-granular elements. In a first 
configuration, the first memory area may be config ured as a write memory, and the second 
memory area may be configured as a read memory. Subsequ ent to writing to and reading 
from the memory in accordance with the first configuration, t he first memory area may be 
configured as a read memory, and the second memory area may be configured as a write 
memory. 
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